## Ordinal data encoding 

This type of encoding technique is used for categorical data which is of ordinal datatype i.e it should be in some order.
E.g Customer rating has some order. In this example, we have assigned order to the weekday name and thus converted this ordinal
feature to some numeric type which is then easily  understood by our ML algorithm.

### Label Encoding-

Replace the categories by a number from 1 to n (or 0 to n-1, depending the implementation), where n is the number of distinct categories of the variable. That is finding out the unique categories present in that variable and then assigning rank to them.
And Further replacing the categorical feature with its respective rank.

In [63]:
import datetime
import calendar

import warnings
warnings.filterwarnings('ignore')

In [10]:
today_date = datetime.datetime.today()   #both today() and now() gives the same information

In [45]:
datetime.datetime.now()     #we import datetime 2 times because datetime is name of the module as well there is also a class
#by the name of datetime in the datetime module.

datetime.datetime(2021, 6, 27, 19, 14, 50, 129639)

In [16]:
today_date

datetime.datetime(2021, 6, 27, 18, 42, 49, 515157)

In [19]:
#today_date-datetime.timedelta(2,hours = 5, minutes = 10)  #timedelta function substracts given date parameters from currentdate
#params and gives output

datetime.datetime(2021, 6, 25, 13, 32, 49, 515157)

In [21]:
#Extracting last 15 days data- Using list comprehension

days = [today_date - datetime.timedelta(days = i) for i in range(0,15)]

In [29]:
import pandas as pd
data = pd.DataFrame(data = days, columns= ['Day'])   #converting to pd dataframe.
data.head()

Unnamed: 0,Day
0,2021-06-27 18:42:49.515157
1,2021-06-26 18:42:49.515157
2,2021-06-25 18:42:49.515157
3,2021-06-24 18:42:49.515157
4,2021-06-23 18:42:49.515157


In [73]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 15 entries, 0 to 14
Data columns (total 1 columns):
 #   Column  Non-Null Count  Dtype         
---  ------  --------------  -----         
 0   Day     15 non-null     datetime64[ns]
dtypes: datetime64[ns](1)
memory usage: 248.0 bytes


In [76]:
data['Day'].dt.dayofyear

0     178
1     177
2     176
3     175
4     174
5     173
6     172
7     171
8     170
9     169
10    168
11    167
12    166
13    165
14    164
Name: Day, dtype: int64

In [85]:
data['Day'].dt.month.head()

0    6
1    6
2    6
3    6
4    6
Name: Day, dtype: int64

In [86]:
data['Day'].dt.day.head()

0    27
1    26
2    25
3    24
4    23
Name: Day, dtype: int64

In [82]:
data['Day_name'] = data['Day'].dt.day_name()   #day_name() gives name of the day. eg sunday.

In [83]:
data.head()


Unnamed: 0,Day,Day_name
0,2021-06-27 18:42:49.515157,Sunday
1,2021-06-26 18:42:49.515157,Saturday
2,2021-06-25 18:42:49.515157,Friday
3,2021-06-24 18:42:49.515157,Thursday
4,2021-06-23 18:42:49.515157,Wednesday


In [87]:
my_dict = {
    'Monday':1,
    'Tuesday':2,
    'Wednesday':3,
    'Thursday':4,
    'Friday':5,
    'Saturday':6,
    'Sunday':7,
}

In [91]:
data['Day_Rank'] = data['Day_name'].map(my_dict)   #this function will map day names to their ranks as defined in the above dictionary.

In [93]:
data   #we can drop the Day_name since it has been converted to numeric datatype.And further use the Day_Rank column for
#our analysis.

Unnamed: 0,Day,Day_name,Day_Rank
0,2021-06-27 18:42:49.515157,Sunday,7
1,2021-06-26 18:42:49.515157,Saturday,6
2,2021-06-25 18:42:49.515157,Friday,5
3,2021-06-24 18:42:49.515157,Thursday,4
4,2021-06-23 18:42:49.515157,Wednesday,3
5,2021-06-22 18:42:49.515157,Tuesday,2
6,2021-06-21 18:42:49.515157,Monday,1
7,2021-06-20 18:42:49.515157,Sunday,7
8,2021-06-19 18:42:49.515157,Saturday,6
9,2021-06-18 18:42:49.515157,Friday,5


In [94]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 15 entries, 0 to 14
Data columns (total 3 columns):
 #   Column    Non-Null Count  Dtype         
---  ------    --------------  -----         
 0   Day       15 non-null     datetime64[ns]
 1   Day_name  15 non-null     object        
 2   Day_Rank  15 non-null     int64         
dtypes: datetime64[ns](1), int64(1), object(1)
memory usage: 488.0+ bytes


## Count or Frequency Encoding-


This type of Encoding technique is used when you have variables with High cardinality.Hence one hot encoding would expand
the feature space dramatically.
So one approach which has been used in Kaggle competitions to handle this type of scenario is to replace each label of the 
categorical variable by its count .This is the amount of times each label has appeared in the dataset.Or this is the frequency
/percentage of that observation within that category.

 Replace the categories by the count of the observations that show that category in the dataset. Similarly, we can replace the category by the frequency -or percentage- of observations in the dataset. That is, if 10 of our 100 observations show the color blue, we would replace blue by 10 if doing count encoding, or by 0.1 if replacing by the frequency.

### Count Encoding -

In [27]:
data = pd.read_csv('http://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data', header = None)

In [28]:
data.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K


In [29]:
data.shape

(32561, 15)

In [30]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32561 entries, 0 to 32560
Data columns (total 15 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   0       32561 non-null  int64 
 1   1       32561 non-null  object
 2   2       32561 non-null  int64 
 3   3       32561 non-null  object
 4   4       32561 non-null  int64 
 5   5       32561 non-null  object
 6   6       32561 non-null  object
 7   7       32561 non-null  object
 8   8       32561 non-null  object
 9   9       32561 non-null  object
 10  10      32561 non-null  int64 
 11  11      32561 non-null  int64 
 12  12      32561 non-null  int64 
 13  13      32561 non-null  object
 14  14      32561 non-null  object
dtypes: int64(6), object(9)
memory usage: 3.7+ MB


In [31]:
data.dtypes[data.dtypes == 'object'].index

Int64Index([1, 3, 5, 6, 7, 8, 9, 13, 14], dtype='int64')

In [32]:
catdata = data[data.dtypes[data.dtypes == 'object'].index]

In [33]:
catdata.head()

Unnamed: 0,1,3,5,6,7,8,9,13,14
0,State-gov,Bachelors,Never-married,Adm-clerical,Not-in-family,White,Male,United-States,<=50K
1,Self-emp-not-inc,Bachelors,Married-civ-spouse,Exec-managerial,Husband,White,Male,United-States,<=50K
2,Private,HS-grad,Divorced,Handlers-cleaners,Not-in-family,White,Male,United-States,<=50K
3,Private,11th,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,United-States,<=50K
4,Private,Bachelors,Married-civ-spouse,Prof-specialty,Wife,Black,Female,Cuba,<=50K


In [34]:
catdata.columns = ['Jobtype','Education','MarritalStatus','Designation','Family','Race','Gender','Country','Salary']

#Assigning names to the column/ Renaming the columns.

In [35]:
#data.columns = ['Jobtype','Education','MarritalStatus','Designation','Family','Race','Gender','Country','Salary']

In [36]:
catdata.head()  #we can see that names have now been assigned to our columns.

Unnamed: 0,Jobtype,Education,MarritalStatus,Designation,Family,Race,Gender,Country,Salary
0,State-gov,Bachelors,Never-married,Adm-clerical,Not-in-family,White,Male,United-States,<=50K
1,Self-emp-not-inc,Bachelors,Married-civ-spouse,Exec-managerial,Husband,White,Male,United-States,<=50K
2,Private,HS-grad,Divorced,Handlers-cleaners,Not-in-family,White,Male,United-States,<=50K
3,Private,11th,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,United-States,<=50K
4,Private,Bachelors,Married-civ-spouse,Prof-specialty,Wife,Black,Female,Cuba,<=50K


In [37]:
len(catdata['Country'])

32561

In [38]:
catdata.columns

Index(['Jobtype', 'Education', 'MarritalStatus', 'Designation', 'Family',
       'Race', 'Gender', 'Country', 'Salary'],
      dtype='object')

In [39]:
#Finding unique categories and their counts in all the above categorical features-

for col in catdata.columns:
    print('{}:{} \n {}'.format(col,catdata[col].unique(),catdata[col].nunique()))
    print('----------------------------------------------------------')

Jobtype:[' State-gov' ' Self-emp-not-inc' ' Private' ' Federal-gov' ' Local-gov'
 ' ?' ' Self-emp-inc' ' Without-pay' ' Never-worked'] 
 9
----------------------------------------------------------
Education:[' Bachelors' ' HS-grad' ' 11th' ' Masters' ' 9th' ' Some-college'
 ' Assoc-acdm' ' Assoc-voc' ' 7th-8th' ' Doctorate' ' Prof-school'
 ' 5th-6th' ' 10th' ' 1st-4th' ' Preschool' ' 12th'] 
 16
----------------------------------------------------------
MarritalStatus:[' Never-married' ' Married-civ-spouse' ' Divorced'
 ' Married-spouse-absent' ' Separated' ' Married-AF-spouse' ' Widowed'] 
 7
----------------------------------------------------------
Designation:[' Adm-clerical' ' Exec-managerial' ' Handlers-cleaners' ' Prof-specialty'
 ' Other-service' ' Sales' ' Craft-repair' ' Transport-moving'
 ' Farming-fishing' ' Machine-op-inspct' ' Tech-support' ' ?'
 ' Protective-serv' ' Armed-Forces' ' Priv-house-serv'] 
 15
----------------------------------------------------------
Family:

In [40]:
for col in catdata.columns:
    print('{}  : {} labels'.format(col,catdata[col].nunique()))

Jobtype  : 9 labels
Education  : 16 labels
MarritalStatus  : 7 labels
Designation  : 15 labels
Family  : 6 labels
Race  : 5 labels
Gender  : 2 labels
Country  : 42 labels
Salary  : 2 labels


In [41]:
catdata['Country'].value_counts().to_dict()

#Our aim here is to replace the names of the Country by their value_counts because we are converting our categorical features to
#numeric for our ML models to understand it.
#E.g - United-States should be replaced by the value 29170, Mexico should be replaced by 643.

{' United-States': 29170,
 ' Mexico': 643,
 ' ?': 583,
 ' Philippines': 198,
 ' Germany': 137,
 ' Canada': 121,
 ' Puerto-Rico': 114,
 ' El-Salvador': 106,
 ' India': 100,
 ' Cuba': 95,
 ' England': 90,
 ' Jamaica': 81,
 ' South': 80,
 ' China': 75,
 ' Italy': 73,
 ' Dominican-Republic': 70,
 ' Vietnam': 67,
 ' Guatemala': 64,
 ' Japan': 62,
 ' Poland': 60,
 ' Columbia': 59,
 ' Taiwan': 51,
 ' Haiti': 44,
 ' Iran': 43,
 ' Portugal': 37,
 ' Nicaragua': 34,
 ' Peru': 31,
 ' France': 29,
 ' Greece': 29,
 ' Ecuador': 28,
 ' Ireland': 24,
 ' Hong': 20,
 ' Trinadad&Tobago': 19,
 ' Cambodia': 19,
 ' Thailand': 18,
 ' Laos': 18,
 ' Yugoslavia': 16,
 ' Outlying-US(Guam-USVI-etc)': 14,
 ' Hungary': 13,
 ' Honduras': 13,
 ' Scotland': 12,
 ' Holand-Netherlands': 1}

In [42]:
country_map = catdata['Country'].value_counts().to_dict()

In [43]:
country_map

{' United-States': 29170,
 ' Mexico': 643,
 ' ?': 583,
 ' Philippines': 198,
 ' Germany': 137,
 ' Canada': 121,
 ' Puerto-Rico': 114,
 ' El-Salvador': 106,
 ' India': 100,
 ' Cuba': 95,
 ' England': 90,
 ' Jamaica': 81,
 ' South': 80,
 ' China': 75,
 ' Italy': 73,
 ' Dominican-Republic': 70,
 ' Vietnam': 67,
 ' Guatemala': 64,
 ' Japan': 62,
 ' Poland': 60,
 ' Columbia': 59,
 ' Taiwan': 51,
 ' Haiti': 44,
 ' Iran': 43,
 ' Portugal': 37,
 ' Nicaragua': 34,
 ' Peru': 31,
 ' France': 29,
 ' Greece': 29,
 ' Ecuador': 28,
 ' Ireland': 24,
 ' Hong': 20,
 ' Trinadad&Tobago': 19,
 ' Cambodia': 19,
 ' Thailand': 18,
 ' Laos': 18,
 ' Yugoslavia': 16,
 ' Outlying-US(Guam-USVI-etc)': 14,
 ' Hungary': 13,
 ' Honduras': 13,
 ' Scotland': 12,
 ' Holand-Netherlands': 1}

In [44]:
catdata['Country'] = catdata['Country'].map(country_map)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  catdata['Country'] = catdata['Country'].map(country_map)


In [45]:
catdata.head(20)   #Country column data has been replaced

Unnamed: 0,Jobtype,Education,MarritalStatus,Designation,Family,Race,Gender,Country,Salary
0,State-gov,Bachelors,Never-married,Adm-clerical,Not-in-family,White,Male,29170,<=50K
1,Self-emp-not-inc,Bachelors,Married-civ-spouse,Exec-managerial,Husband,White,Male,29170,<=50K
2,Private,HS-grad,Divorced,Handlers-cleaners,Not-in-family,White,Male,29170,<=50K
3,Private,11th,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,29170,<=50K
4,Private,Bachelors,Married-civ-spouse,Prof-specialty,Wife,Black,Female,95,<=50K
5,Private,Masters,Married-civ-spouse,Exec-managerial,Wife,White,Female,29170,<=50K
6,Private,9th,Married-spouse-absent,Other-service,Not-in-family,Black,Female,81,<=50K
7,Self-emp-not-inc,HS-grad,Married-civ-spouse,Exec-managerial,Husband,White,Male,29170,>50K
8,Private,Masters,Never-married,Prof-specialty,Not-in-family,White,Female,29170,>50K
9,Private,Bachelors,Married-civ-spouse,Exec-managerial,Husband,White,Male,29170,>50K


### Frequency Encoding-

In [66]:
data = pd.read_csv('http://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data', header = None)

In [67]:
data.head(2)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K


In [68]:
catdata = data[data.dtypes[data.dtypes == 'object'].index]

In [69]:
catdata.columns = ['Jobtype','Education','MarritalStatus','Designation','Family','Race','Gender','Country','Salary']

In [70]:
catdata.head()

Unnamed: 0,Jobtype,Education,MarritalStatus,Designation,Family,Race,Gender,Country,Salary
0,State-gov,Bachelors,Never-married,Adm-clerical,Not-in-family,White,Male,United-States,<=50K
1,Self-emp-not-inc,Bachelors,Married-civ-spouse,Exec-managerial,Husband,White,Male,United-States,<=50K
2,Private,HS-grad,Divorced,Handlers-cleaners,Not-in-family,White,Male,United-States,<=50K
3,Private,11th,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,United-States,<=50K
4,Private,Bachelors,Married-civ-spouse,Prof-specialty,Wife,Black,Female,Cuba,<=50K


In [71]:
my_dict = catdata['Country'].value_counts().to_dict()  
my_dict

{' United-States': 29170,
 ' Mexico': 643,
 ' ?': 583,
 ' Philippines': 198,
 ' Germany': 137,
 ' Canada': 121,
 ' Puerto-Rico': 114,
 ' El-Salvador': 106,
 ' India': 100,
 ' Cuba': 95,
 ' England': 90,
 ' Jamaica': 81,
 ' South': 80,
 ' China': 75,
 ' Italy': 73,
 ' Dominican-Republic': 70,
 ' Vietnam': 67,
 ' Guatemala': 64,
 ' Japan': 62,
 ' Poland': 60,
 ' Columbia': 59,
 ' Taiwan': 51,
 ' Haiti': 44,
 ' Iran': 43,
 ' Portugal': 37,
 ' Nicaragua': 34,
 ' Peru': 31,
 ' France': 29,
 ' Greece': 29,
 ' Ecuador': 28,
 ' Ireland': 24,
 ' Hong': 20,
 ' Trinadad&Tobago': 19,
 ' Cambodia': 19,
 ' Thailand': 18,
 ' Laos': 18,
 ' Yugoslavia': 16,
 ' Outlying-US(Guam-USVI-etc)': 14,
 ' Hungary': 13,
 ' Honduras': 13,
 ' Scotland': 12,
 ' Holand-Netherlands': 1}

In [72]:
for i in my_dict:     #Replacing by frequency
    my_dict[i] = my_dict[i]/len(catdata['Country'])
my_dict

{' United-States': 0.895857006848684,
 ' Mexico': 0.019747550750898315,
 ' ?': 0.0179048555019809,
 ' Philippines': 0.006080894321427475,
 ' Germany': 0.004207487485028101,
 ' Canada': 0.00371610208531679,
 ' Puerto-Rico': 0.0035011209729430915,
 ' El-Salvador': 0.003255428273087436,
 ' India': 0.0030711587481956942,
 ' Cuba': 0.0029176008107859096,
 ' England': 0.002764042873376125,
 ' Jamaica': 0.0024876385860385123,
 ' South': 0.0024569269985565555,
 ' China': 0.002303369061146771,
 ' Italy': 0.0022419458861828567,
 ' Dominican-Republic': 0.002149811123736986,
 ' Vietnam': 0.002057676361291115,
 ' Guatemala': 0.001965541598845244,
 ' Japan': 0.0019041184238813304,
 ' Poland': 0.0018426952489174165,
 ' Columbia': 0.0018119836614354597,
 ' Taiwan': 0.001566290961579804,
 ' Haiti': 0.0013513098492061054,
 ' Iran': 0.0013205982617241485,
 ' Portugal': 0.001136328736832407,
 ' Nicaragua': 0.001044193974386536,
 ' Peru': 0.0009520592119406652,
 ' France': 0.0008906360369767513,
 ' Greece'

In [73]:
catdata['Country'] = catdata['Country'].map(my_dict)

In [74]:
catdata.head(20)   #Country column has been changed by its frequency value. This is frequency encoding.

Unnamed: 0,Jobtype,Education,MarritalStatus,Designation,Family,Race,Gender,Country,Salary
0,State-gov,Bachelors,Never-married,Adm-clerical,Not-in-family,White,Male,0.895857,<=50K
1,Self-emp-not-inc,Bachelors,Married-civ-spouse,Exec-managerial,Husband,White,Male,0.895857,<=50K
2,Private,HS-grad,Divorced,Handlers-cleaners,Not-in-family,White,Male,0.895857,<=50K
3,Private,11th,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0.895857,<=50K
4,Private,Bachelors,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0.002918,<=50K
5,Private,Masters,Married-civ-spouse,Exec-managerial,Wife,White,Female,0.895857,<=50K
6,Private,9th,Married-spouse-absent,Other-service,Not-in-family,Black,Female,0.002488,<=50K
7,Self-emp-not-inc,HS-grad,Married-civ-spouse,Exec-managerial,Husband,White,Male,0.895857,>50K
8,Private,Masters,Never-married,Prof-specialty,Not-in-family,White,Female,0.895857,>50K
9,Private,Bachelors,Married-civ-spouse,Exec-managerial,Husband,White,Male,0.895857,>50K


### Advantages of using Count/Frequency Encoding-

1.Easy to use.

2.This technique does not increase number of features as one hot encoding technique does.

### Disadvantages of using Count/Frequeny Encoding -

1.It will provide same weights to categories if their counts is same.e.g if Mexico and USA both occur 600 times in the country
column, it wont be able to differentiate between them.

### Target Guided Ordinal Encoding

In [78]:
import pandas as pd

In [79]:
data = pd.read_csv('titanic_train.csv', usecols = ['Cabin','Survived'] )

In [80]:
data.head()

Unnamed: 0,Survived,Cabin
0,0,
1,1,C85
2,1,
3,1,C123
4,0,


In [81]:
data.Cabin.fillna('Missing', inplace = True)  #Replacing the Nan values with 'missing'.

In [82]:
data.head()   #null values have been replaced by Missing

Unnamed: 0,Survived,Cabin
0,0,Missing
1,1,C85
2,1,Missing
3,1,C123
4,0,Missing


In [83]:
data['Cabin']=data['Cabin'].astype(str).str[0]  #Converting to str type and Extracting only first letters 

In [84]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 2 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   Survived  891 non-null    int64 
 1   Cabin     891 non-null    object
dtypes: int64(1), object(1)
memory usage: 14.0+ KB


In [85]:
data['Cabin'].head()

0    M
1    C
2    M
3    C
4    M
Name: Cabin, dtype: object

In [86]:
data['Cabin'].unique()

array(['M', 'C', 'E', 'G', 'D', 'A', 'B', 'F', 'T'], dtype=object)

In [87]:
data.groupby(data['Cabin'])['Survived'].mean()   #finding frequency of survival based on the target value.

Cabin
A    0.466667
B    0.744681
C    0.593220
D    0.757576
E    0.750000
F    0.615385
G    0.500000
M    0.299854
T    0.000000
Name: Survived, dtype: float64

In [88]:
data.groupby(data['Cabin'])['Survived'].mean().sort_values()  #sorting the values

#T has the lowest frequency while D has the highest frequency.Hence D should get highest rank.

Cabin
T    0.000000
M    0.299854
A    0.466667
G    0.500000
C    0.593220
F    0.615385
B    0.744681
E    0.750000
D    0.757576
Name: Survived, dtype: float64

In [89]:
ordinal_labels = data.groupby(data['Cabin'])['Survived'].mean().sort_values().index  #getting index of the sorted order.
ordinal_labels

Index(['T', 'M', 'A', 'G', 'C', 'F', 'B', 'E', 'D'], dtype='object', name='Cabin')

In [90]:
ordinal_labels2 = {k:i for i,k in enumerate(ordinal_labels,0)}   # 'i' will store the index and k will store the name.Enumerate function returns output in this form.
ordinal_labels2

#We can see D has got the highest ranking since it was sorted as per frequency of occurence in ascending order.

{'T': 0, 'M': 1, 'A': 2, 'G': 3, 'C': 4, 'F': 5, 'B': 6, 'E': 7, 'D': 8}

In [91]:
data['Cabin'].head()

0    M
1    C
2    M
3    C
4    M
Name: Cabin, dtype: object

In [92]:
data['Cabin_ordinal_labels'] = data['Cabin'].map(ordinal_labels2)    #this will map the original categorical data to corresponding
#numeric data from the ordinal_labels2.

In [94]:
data.head(20) #we can drop the Cabin column now.

Unnamed: 0,Survived,Cabin,Cabin_ordinal_labels
0,0,M,1
1,1,C,4
2,1,M,1
3,1,C,4
4,0,M,1
5,0,M,1
6,0,E,7
7,0,M,1
8,1,M,1
9,1,M,1


In [211]:
data.drop('Cabin', axis = 1, inplace = True)

In [212]:
data.head()  #Cabin column dropped.

Unnamed: 0,Survived,Cabin_ordinal_labels
0,0,1
1,1,4
2,1,1
3,1,4
4,0,1


Here based on the target feature whichever feature value has the highest survival rate, highest rank will be given to that.
This is what we call target guided ordinal encoding.

## Mean Encoding

In [218]:
data = pd.read_csv('titanic_train.csv', usecols = ['Cabin','Survived'] )

In [219]:
data.head()

Unnamed: 0,Survived,Cabin
0,0,
1,1,C85
2,1,
3,1,C123
4,0,


In [220]:
data.fillna('Missing',inplace = True)

In [221]:
data.head()

Unnamed: 0,Survived,Cabin
0,0,Missing
1,1,C85
2,1,Missing
3,1,C123
4,0,Missing


In [223]:
data['Cabin'] = data['Cabin'].astype(str).str[0]
data['Cabin'].head()

0    M
1    C
2    M
3    C
4    M
Name: Cabin, dtype: object

In [224]:
data.groupby(data['Cabin'])['Survived'].mean()

Cabin
A    0.466667
B    0.744681
C    0.593220
D    0.757576
E    0.750000
F    0.615385
G    0.500000
M    0.299854
T    0.000000
Name: Survived, dtype: float64

In [226]:
mean_ordinal = data.groupby(data['Cabin'])['Survived'].mean().to_dict()
mean_ordinal

{'A': 0.4666666666666667,
 'B': 0.7446808510638298,
 'C': 0.5932203389830508,
 'D': 0.7575757575757576,
 'E': 0.75,
 'F': 0.6153846153846154,
 'G': 0.5,
 'M': 0.29985443959243085,
 'T': 0.0}

In [227]:
data['mean_encoded_labels'] = data['Cabin'].map(mean_ordinal)   #mapping the categorical data from cabin to its survival frequency
#and thus converting to numeric type.

In [228]:
data.head()   #we can drop the Cabin column now.

Unnamed: 0,Survived,Cabin,mean_encoded_labels
0,0,M,0.299854
1,1,C,0.59322
2,1,M,0.299854
3,1,C,0.59322
4,0,M,0.299854


In [229]:
data.drop('Cabin', axis = 1, inplace = True)

In [230]:
data.head()   #we have now handled the categorical feature of Cabin column using Mean Encoding

Unnamed: 0,Survived,mean_encoded_labels
0,0,0.299854
1,1,0.59322
2,1,0.299854
3,1,0.59322
4,0,0.299854


### Advantages of using Mean Encoding-

1.Creates a monotonous relationship between the target variable and feature.

### Disadvantages of using Mean Encoding-


1.This type of encoding is prone to Overfitting

### Probability Ratio Encoding-

In [1]:
import pandas as pd

In [4]:
data = pd.read_csv('titanic_train.csv', usecols = ['Cabin','Survived'])

In [5]:
data.head()

Unnamed: 0,Survived,Cabin
0,0,
1,1,C85
2,1,
3,1,C123
4,0,


In [6]:
data['Cabin'].fillna('Missing', inplace = True)

In [7]:
data.head()

Unnamed: 0,Survived,Cabin
0,0,Missing
1,1,C85
2,1,Missing
3,1,C123
4,0,Missing


In [8]:
data['Cabin'] = data['Cabin'].astype(str).str[0]

In [9]:
data.head()

Unnamed: 0,Survived,Cabin
0,0,M
1,1,C
2,1,M
3,1,C
4,0,M


In [10]:
data.Cabin.unique()

array(['M', 'C', 'E', 'G', 'D', 'A', 'B', 'F', 'T'], dtype=object)

In [12]:
prob_survived = data.groupby(['Cabin'])['Survived'].mean()
prob_survived   #This is the probability of survival for each Class.

Cabin
A    0.466667
B    0.744681
C    0.593220
D    0.757576
E    0.750000
F    0.615385
G    0.500000
M    0.299854
T    0.000000
Name: Survived, dtype: float64

In [13]:
prob_survived = pd.DataFrame(prob_survived)  #converting to a dataframe.
prob_survived

Unnamed: 0_level_0,Survived
Cabin,Unnamed: 1_level_1
A,0.466667
B,0.744681
C,0.59322
D,0.757576
E,0.75
F,0.615385
G,0.5
M,0.299854
T,0.0


In [14]:
prob_survived['Died'] = 1 - prob_survived['Survived']

In [16]:
prob_survived.head()

Unnamed: 0_level_0,Survived,Died
Cabin,Unnamed: 1_level_1,Unnamed: 2_level_1
A,0.466667,0.533333
B,0.744681,0.255319
C,0.59322,0.40678
D,0.757576,0.242424
E,0.75,0.25


In [17]:
prob_survived['Survival_Ratio'] = prob_survived['Survived']/prob_survived['Died']

In [18]:
prob_survived.head()

Unnamed: 0_level_0,Survived,Died,Survival_Ratio
Cabin,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
A,0.466667,0.533333,0.875
B,0.744681,0.255319,2.916667
C,0.59322,0.40678,1.458333
D,0.757576,0.242424,3.125
E,0.75,0.25,3.0


In [20]:
Cabin_encoded = prob_survived['Survival_Ratio'].to_dict()
Cabin_encoded

{'A': 0.875,
 'B': 2.916666666666666,
 'C': 1.4583333333333333,
 'D': 3.125,
 'E': 3.0,
 'F': 1.6000000000000003,
 'G': 1.0,
 'M': 0.42827442827442824,
 'T': 0.0}

In [21]:
data['Cabin_encoded'] = data['Cabin'].map(Cabin_encoded)

In [22]:
data.head()

Unnamed: 0,Survived,Cabin,Cabin_encoded
0,0,M,0.428274
1,1,C,1.458333
2,1,M,0.428274
3,1,C,1.458333
4,0,M,0.428274


In [23]:
data.drop('Cabin', axis = 1, inplace = True)

In [24]:
data.head()   #Cabin column has been encoded.

Unnamed: 0,Survived,Cabin_encoded
0,0,0.428274
1,1,1.458333
2,1,0.428274
3,1,1.458333
4,0,0.428274
