## Ordinal numbering encoding or Label Encoding

**Ordinal categorical variables**

Ordinal data is a categorical, statistical data type where the variables have natural, ordered categories and the distances between the categories is not known.

![Example](https://miro.medium.com/max/1280/0*73R9Ll0JkBBbAqbC.png)

For example:

- Student's grade in an exam (A, B, C or Fail).
- Educational level, with the categories: Elementary school,  High school, College graduate, PhD ranked from 1 to 4.

When the categorical variables are ordinal, the most straightforward best approach is to replace the labels by some ordinal number based on the ranks.




In [15]:
import pandas as pd
import datetime

### Example 1

In [16]:
df_today = datetime.datetime.today()
df_today

datetime.datetime(2021, 6, 8, 12, 4, 16, 938974)

In [17]:
# create a variable with dates, and from that extract the weekday
# I create a list of dates with 20 days difference from today
# and then transform it into a datafame

df_date_list = [df_base - datetime.timedelta(days=x) for x in range(0, 20)]
df = pd.DataFrame(df_date_list)
df.columns = ['day']
df

Unnamed: 0,day
0,2021-06-08 11:54:27.415168
1,2021-06-07 11:54:27.415168
2,2021-06-06 11:54:27.415168
3,2021-06-05 11:54:27.415168
4,2021-06-04 11:54:27.415168
5,2021-06-03 11:54:27.415168
6,2021-06-02 11:54:27.415168
7,2021-06-01 11:54:27.415168
8,2021-05-31 11:54:27.415168
9,2021-05-30 11:54:27.415168


In [26]:
# extract the week day name

df['day_of_week'] = df['day'].dt.strftime("%A")

# df.dtypes
# type(df)

df.head(10)

Unnamed: 0,day,day_of_week
0,2021-06-08 11:54:27.415168,Tuesday
1,2021-06-07 11:54:27.415168,Monday
2,2021-06-06 11:54:27.415168,Sunday
3,2021-06-05 11:54:27.415168,Saturday
4,2021-06-04 11:54:27.415168,Friday
5,2021-06-03 11:54:27.415168,Thursday
6,2021-06-02 11:54:27.415168,Wednesday
7,2021-06-01 11:54:27.415168,Tuesday
8,2021-05-31 11:54:27.415168,Monday
9,2021-05-30 11:54:27.415168,Sunday


In [27]:
# Engineer categorical variable by ordinal number replacement

weekday_map = {'Monday':1,
               'Tuesday':2,
               'Wednesday':3,
               'Thursday':4,
               'Friday':5,
               'Saturday':6,
               'Sunday':7
}

df['day_ordinal'] = df.day_of_week.map(weekday_map)
df.head(20)

Unnamed: 0,day,day_of_week,day_ordinal
0,2021-06-08 11:54:27.415168,Tuesday,2
1,2021-06-07 11:54:27.415168,Monday,1
2,2021-06-06 11:54:27.415168,Sunday,7
3,2021-06-05 11:54:27.415168,Saturday,6
4,2021-06-04 11:54:27.415168,Friday,5
5,2021-06-03 11:54:27.415168,Thursday,4
6,2021-06-02 11:54:27.415168,Wednesday,3
7,2021-06-01 11:54:27.415168,Tuesday,2
8,2021-05-31 11:54:27.415168,Monday,1
9,2021-05-30 11:54:27.415168,Sunday,7


### Example 2

It's not a proper dataset for label encoding but , how performe label encoding ? here we do it

In [29]:
import seaborn as sns
sns.get_dataset_names()


['anagrams',
 'anscombe',
 'attention',
 'brain_networks',
 'car_crashes',
 'diamonds',
 'dots',
 'exercise',
 'flights',
 'fmri',
 'gammas',
 'geyser',
 'iris',
 'mpg',
 'penguins',
 'planets',
 'tips',
 'titanic']

In [31]:
df = sns.load_dataset("iris")

In [32]:
df

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa
...,...,...,...,...,...
145,6.7,3.0,5.2,2.3,virginica
146,6.3,2.5,5.0,1.9,virginica
147,6.5,3.0,5.2,2.0,virginica
148,6.2,3.4,5.4,2.3,virginica


In [37]:
df["species"].value_counts()

setosa        50
versicolor    50
virginica     50
Name: species, dtype: int64

In [38]:
# Import label encoder
from sklearn import preprocessing
  
# label_encoder object knows how to understand word labels.
label_encoder = preprocessing.LabelEncoder()

In [39]:
# Encode labels in column 'species'.
df['species']= label_encoder.fit_transform(df['species'])

In [42]:
df

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,0
1,4.9,3.0,1.4,0.2,0
2,4.7,3.2,1.3,0.2,0
3,4.6,3.1,1.5,0.2,0
4,5.0,3.6,1.4,0.2,0
...,...,...,...,...,...
145,6.7,3.0,5.2,2.3,2
146,6.3,2.5,5.0,1.9,2
147,6.5,3.0,5.2,2.0,2
148,6.2,3.4,5.4,2.3,2


In [44]:
df["species"].unique()

array([0, 1, 2])

In [46]:
df["species"].value_counts()

0    50
1    50
2    50
Name: species, dtype: int64

[sklearn.preprocessing.LabeLencoder](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html)

fit(y) : Fit label encoder.

fit_transform(y) : Fit label encoder and return encoded labels.

get_params([deep]) : Get parameters for this estimator.

inverse_transform(y) : Transform labels back to original encoding.

set_params(**params) : Set the parameters of this estimator.

transform(y) : Transform labels to normalized encoding.

In [48]:

LE = preprocessing.LabelEncoder()
LE.fit_transform([10,10,21,21,3,3,44,54,21,43,43,10])

array([1, 1, 2, 2, 0, 0, 4, 5, 2, 3, 3, 1], dtype=int64)

In [54]:
l = [10,10,21,21,3,3,44,54,21,43,43,10]
l.sort()
l

[3, 3, 10, 10, 10, 21, 21, 21, 43, 43, 44, 54]

In [55]:

LE.fit(["paris", "paris", "tokyo", "amsterdam"])


LabelEncoder()

In [56]:
list(LE.classes_)

['amsterdam', 'paris', 'tokyo']

In [57]:
LE.fit_transform(["paris", "paris", "tokyo", "amsterdam"])

array([1, 1, 2, 0], dtype=int64)

In [60]:
list(LE.inverse_transform([2, 0, 1]))

['tokyo', 'amsterdam', 'paris']

#### Ordinal Measurement Advantages

Ordinal measurement is normally used for surveys and questionnaires. Statistical analysis is applied to the responses once they are collected to place the people who took the survey into the various categories. The data is then compared to draw inferences and conclusions about the whole surveyed population with regard to the specific variables. The advantage of using ordinal measurement is ease of collation and categorization. If you ask a survey question without providing the variables, the answers are likely to be so diverse they cannot be converted to statistics.

With Respect to Machine Learning

- Keeps the semantical information of the variable (human readable content)
- Straightforward

#### Ordinal Measurement Disadvantages
The same characteristics of ordinal measurement that create its advantages also create certain disadvantages. The responses are often so narrow in relation to the question that they create or magnify bias that is not factored into the survey. For example, on the question about satisfaction with the governor, people might be satisfied with his job performance but upset about a recent sex scandal. The survey question might lead respondents to state their dissatisfaction about the scandal, in spite of satisfaction with his job performance -- but the statistical conclusion will not differentiate.

With Respect to Machine Learning

- Does not add machine learning valuable information


[!! 11 Categorical variable Encoding !!](https://www.kaggle.com/subinium/11-categorical-encoders-and-benchmark/log)