## Ordinal Number Encoding
- In ordinal encoding, each unique category value is assigned an integer value.
- For example, “red” is 1, “green” is 2, and “blue” is 3.
- This is called an ordinal encoding or an integer encoding and is easily reversible. Often, integer values starting at zero     are used.

In [10]:
import pandas as pd

In [23]:
import datetime as dt

In [24]:
today_date=dt.datetime.today()
today_date

datetime.datetime(2021, 4, 14, 7, 13, 38, 756176)

In [25]:
today_date-dt.timedelta(10)

datetime.datetime(2021, 4, 4, 7, 13, 38, 756176)

### List Comprehension

In [26]:
days=[today_date-dt.timedelta(x) for x in range(0,15)]

In [27]:
import pandas as pd
data=pd.DataFrame(days)
data.columns=["Day"]

In [28]:
data.head()

Unnamed: 0,Day
0,2021-04-14 07:13:38.756176
1,2021-04-13 07:13:38.756176
2,2021-04-12 07:13:38.756176
3,2021-04-11 07:13:38.756176
4,2021-04-10 07:13:38.756176


In [29]:
data['weekday']=data['Day'].dt.day_name()

In [31]:
data.head(6)

Unnamed: 0,Day,weekday
0,2021-04-14 07:13:38.756176,Wednesday
1,2021-04-13 07:13:38.756176,Tuesday
2,2021-04-12 07:13:38.756176,Monday
3,2021-04-11 07:13:38.756176,Sunday
4,2021-04-10 07:13:38.756176,Saturday
5,2021-04-09 07:13:38.756176,Friday


In [32]:
dictionary={'Monday':1,'Tuesday':2,'Wednesday':3,'Thursday':4,'Friday':5,'Saturday':6,'Sunday':7}

In [33]:
dictionary

{'Monday': 1,
 'Tuesday': 2,
 'Wednesday': 3,
 'Thursday': 4,
 'Friday': 5,
 'Saturday': 6,
 'Sunday': 7}

In [35]:
data['weekday_ordinal']=data['weekday'].map(dictionary)

In [39]:
data

Unnamed: 0,Day,weekday,weekday_ordinal
0,2021-04-14 07:13:38.756176,Wednesday,3
1,2021-04-13 07:13:38.756176,Tuesday,2
2,2021-04-12 07:13:38.756176,Monday,1
3,2021-04-11 07:13:38.756176,Sunday,7
4,2021-04-10 07:13:38.756176,Saturday,6
5,2021-04-09 07:13:38.756176,Friday,5
6,2021-04-08 07:13:38.756176,Thursday,4
7,2021-04-07 07:13:38.756176,Wednesday,3
8,2021-04-06 07:13:38.756176,Tuesday,2
9,2021-04-05 07:13:38.756176,Monday,1


## Count or Frequency Encoding
It is a way to utilize the frequency of the categories as labels. In the cases where the frequency is related somewhat with the target variable, it helps the model to understand and assign the weight in direct and inverse proportion, depending on the nature of the data.
<br>
Three-step for this :<br>
1.Select a categorical variable you would like to transform<br>
2.Group by the categorical variable and obtain counts of each category<br>
3.Join it back with the training dataset<br>

In [40]:
train_set = pd.read_csv('http://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data' , header = None,index_col=None)
train_set.head() 

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K


In [41]:
train_set[1].unique()

array([' State-gov', ' Self-emp-not-inc', ' Private', ' Federal-gov',
       ' Local-gov', ' ?', ' Self-emp-inc', ' Without-pay',
       ' Never-worked'], dtype=object)

In [42]:
len(train_set[1].unique())

9

In [43]:
columns=[1,3,5,6,7,8,9,13]

In [44]:
train_set=train_set[columns]

In [45]:
train_set.columns=['Employment','Degree','Status','Designation','family_job','Race','Sex','Country']

In [47]:
train_set.head(6)

Unnamed: 0,Employment,Degree,Status,Designation,family_job,Race,Sex,Country
0,State-gov,Bachelors,Never-married,Adm-clerical,Not-in-family,White,Male,United-States
1,Self-emp-not-inc,Bachelors,Married-civ-spouse,Exec-managerial,Husband,White,Male,United-States
2,Private,HS-grad,Divorced,Handlers-cleaners,Not-in-family,White,Male,United-States
3,Private,11th,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,United-States
4,Private,Bachelors,Married-civ-spouse,Prof-specialty,Wife,Black,Female,Cuba
5,Private,Masters,Married-civ-spouse,Exec-managerial,Wife,White,Female,United-States


In [50]:
for feature in train_set.columns[:]:
    print(feature,":",len(train_set[feature].unique()),'labels')

Employment : 9 labels
Degree : 16 labels
Status : 7 labels
Designation : 15 labels
family_job : 6 labels
Race : 5 labels
Sex : 2 labels
Country : 42 labels


In [57]:
country_map=train_set['Country'].value_counts().to_dict()

In [58]:
country_map

{' United-States': 29170,
 ' Mexico': 643,
 ' ?': 583,
 ' Philippines': 198,
 ' Germany': 137,
 ' Canada': 121,
 ' Puerto-Rico': 114,
 ' El-Salvador': 106,
 ' India': 100,
 ' Cuba': 95,
 ' England': 90,
 ' Jamaica': 81,
 ' South': 80,
 ' China': 75,
 ' Italy': 73,
 ' Dominican-Republic': 70,
 ' Vietnam': 67,
 ' Guatemala': 64,
 ' Japan': 62,
 ' Poland': 60,
 ' Columbia': 59,
 ' Taiwan': 51,
 ' Haiti': 44,
 ' Iran': 43,
 ' Portugal': 37,
 ' Nicaragua': 34,
 ' Peru': 31,
 ' France': 29,
 ' Greece': 29,
 ' Ecuador': 28,
 ' Ireland': 24,
 ' Hong': 20,
 ' Cambodia': 19,
 ' Trinadad&Tobago': 19,
 ' Thailand': 18,
 ' Laos': 18,
 ' Yugoslavia': 16,
 ' Outlying-US(Guam-USVI-etc)': 14,
 ' Honduras': 13,
 ' Hungary': 13,
 ' Scotland': 12,
 ' Holand-Netherlands': 1}

In [59]:
train_set['Country']=train_set['Country'].map(country_map)
train_set.head(20)

Unnamed: 0,Employment,Degree,Status,Designation,family_job,Race,Sex,Country
0,State-gov,Bachelors,Never-married,Adm-clerical,Not-in-family,White,Male,29170
1,Self-emp-not-inc,Bachelors,Married-civ-spouse,Exec-managerial,Husband,White,Male,29170
2,Private,HS-grad,Divorced,Handlers-cleaners,Not-in-family,White,Male,29170
3,Private,11th,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,29170
4,Private,Bachelors,Married-civ-spouse,Prof-specialty,Wife,Black,Female,95
5,Private,Masters,Married-civ-spouse,Exec-managerial,Wife,White,Female,29170
6,Private,9th,Married-spouse-absent,Other-service,Not-in-family,Black,Female,81
7,Self-emp-not-inc,HS-grad,Married-civ-spouse,Exec-managerial,Husband,White,Male,29170
8,Private,Masters,Never-married,Prof-specialty,Not-in-family,White,Female,29170
9,Private,Bachelors,Married-civ-spouse,Exec-managerial,Husband,White,Male,29170


##### Advantages
1. Easy To Use
2. Not increasing feature space
##### Disadvantages
1. It will provide same weight if the frequencies are same

#### Target Guided Ordinal Encoding
1. Ordering the labels according to the target
2. Replace the labels by the joint probability of being 1 or 0

In [61]:
import pandas as pd
df=pd.read_csv('titanic.csv', usecols=['Cabin','Survived'])
df.head(6)

Unnamed: 0,Survived,Cabin
0,0,
1,1,C85
2,1,
3,1,C123
4,0,
5,0,


In [62]:
df['Cabin'].fillna('Missing',inplace=True)

In [63]:
df.head(6)

Unnamed: 0,Survived,Cabin
0,0,Missing
1,1,C85
2,1,Missing
3,1,C123
4,0,Missing
5,0,Missing


In [64]:
df['Cabin']=df['Cabin'].astype(str).str[0]

In [66]:
df.head(6)

Unnamed: 0,Survived,Cabin
0,0,M
1,1,C
2,1,M
3,1,C
4,0,M
5,0,M


In [67]:
df.Cabin.unique()

array(['M', 'C', 'E', 'G', 'D', 'A', 'B', 'F', 'T'], dtype=object)

In [69]:
## Based on the target(Survived) we are trying to find out the mean..thats why we call as target Guided Encoding
df.groupby(['Cabin'])['Survived'].mean()

Cabin
A    0.466667
B    0.744681
C    0.593220
D    0.757576
E    0.750000
F    0.615385
G    0.500000
M    0.299854
T    0.000000
Name: Survived, dtype: float64

In [73]:
ordinal_labels=df.groupby(['Cabin'])['Survived'].mean().sort_values().index
ordinal_labels

Index(['T', 'M', 'A', 'G', 'C', 'F', 'B', 'E', 'D'], dtype='object', name='Cabin')

In [74]:
enumerate(ordinal_labels,0)

<enumerate at 0x2567dd5fd38>

In [75]:
ordinal_labels2={k:i for i,k in enumerate(ordinal_labels,0)}
ordinal_labels2

{'T': 0, 'M': 1, 'A': 2, 'G': 3, 'C': 4, 'F': 5, 'B': 6, 'E': 7, 'D': 8}

In [76]:
df['Cabin_ordinal_labels']=df['Cabin'].map(ordinal_labels2)
df.head()

Unnamed: 0,Survived,Cabin,Cabin_ordinal_labels
0,0,M,1
1,1,C,4
2,1,M,1
3,1,C,4
4,0,M,1


### Mean Encoding
Mean encoding is similar to label encoding, except here labels are correlated directly with the target. For example, in mean target encoding for each category in the feature label is decided with the mean value of the target variable on a training data. This encoding method brings out the relation between similar categories, but the connections are bounded within the categories and target itself. 

In [77]:
mean_ordinal=df.groupby(['Cabin'])['Survived'].mean().to_dict()

In [78]:
mean_ordinal

{'A': 0.4666666666666667,
 'B': 0.7446808510638298,
 'C': 0.5932203389830508,
 'D': 0.7575757575757576,
 'E': 0.75,
 'F': 0.6153846153846154,
 'G': 0.5,
 'M': 0.29985443959243085,
 'T': 0.0}

In [79]:
df['mean_ordinal_encode']=df['Cabin'].map(mean_ordinal)
df.head()

Unnamed: 0,Survived,Cabin,Cabin_ordinal_labels,mean_ordinal_encode
0,0,M,1,0.299854
1,1,C,4,0.59322
2,1,M,1,0.299854
3,1,C,4,0.59322
4,0,M,1,0.299854


### Pros of MeanEncoding:
  - Capture information within the label, therefore rendering more predictive features
  - Creates a monotonic relationship between the variable and the target
### Cons of MeanEncodig:
  - It may cause over-fitting in the model.