# Ordinal numbering encoding or Label Encoding(Manually)

<b>Ordinal categorical variables</b>
Ordinal data is a categorical, statistical data type where the variables have natural, ordered categories and the distances between the categories is not known.

For example:

-Student's grade in an exam (A, B, C or Fail).<br>
-Educational level, with the categories: Elementary school, High school, College graduate, PhD ranked from 1 to 4.

<b>When the categorical variables are ordinal, the most straightforward best approach is to replace the labels by some ordinal number based on the ranks.

In [1]:
import pandas as pd 

In [2]:
data=pd.read_csv('adult.csv')

In [3]:
data.head()

Unnamed: 0,age,workclass,fnlwgt,education,educational-num,marital-status,occupation,relationship,race,gender,capital-gain,capital-loss,hours-per-week,native-country,income
0,25,Private,226802,11th,7,Never-married,Machine-op-inspct,Own-child,Black,Male,0,0,40,United-States,<=50K
1,38,Private,89814,HS-grad,9,Married-civ-spouse,Farming-fishing,Husband,White,Male,0,0,50,United-States,<=50K
2,28,Local-gov,336951,Assoc-acdm,12,Married-civ-spouse,Protective-serv,Husband,White,Male,0,0,40,United-States,>50K
3,44,Private,160323,Some-college,10,Married-civ-spouse,Machine-op-inspct,Husband,Black,Male,7688,0,40,United-States,>50K
4,18,?,103497,Some-college,10,Never-married,?,Own-child,White,Female,0,0,30,United-States,<=50K


In [4]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 48842 entries, 0 to 48841
Data columns (total 15 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   age              48842 non-null  int64 
 1   workclass        48842 non-null  object
 2   fnlwgt           48842 non-null  int64 
 3   education        48842 non-null  object
 4   educational-num  48842 non-null  int64 
 5   marital-status   48842 non-null  object
 6   occupation       48842 non-null  object
 7   relationship     48842 non-null  object
 8   race             48842 non-null  object
 9   gender           48842 non-null  object
 10  capital-gain     48842 non-null  int64 
 11  capital-loss     48842 non-null  int64 
 12  hours-per-week   48842 non-null  int64 
 13  native-country   48842 non-null  object
 14  income           48842 non-null  object
dtypes: int64(6), object(9)
memory usage: 5.6+ MB


In [5]:
data.education.value_counts()

HS-grad         15784
Some-college    10878
Bachelors        8025
Masters          2657
Assoc-voc        2061
11th             1812
Assoc-acdm       1601
10th             1389
7th-8th           955
Prof-school       834
9th               756
12th              657
Doctorate         594
5th-6th           509
1st-4th           247
Preschool          83
Name: education, dtype: int64

In [6]:
len(data.education.value_counts())

16

There are 16 different categories in Education feature. First I will categorize 1st-12th class as <b>School</b>, then i will rank all categories as below :

1: PreSchool<br>
2: School <br>
3: HS-grad <br>
4: Some-college <br>
5: Bachelors <br>
6: Prof-school <br>
7: Assoc-acdm <br>
8: Assoc- voc <br>
9: Masters<br>
10: Doctorate<br>

In [7]:
school_data=data[(data['education']=='1st-4th')|(data['education']=='5th-6th')|(data['education']=='7th-8th')|(data['education']=='9th')|(data['education']=='10th')|(data['education']=='11th')|( data['education']=='12th' )]

In [8]:
#1st-12th education level are enlisted in below DataFrame
school_data

Unnamed: 0,age,workclass,fnlwgt,education,educational-num,marital-status,occupation,relationship,race,gender,capital-gain,capital-loss,hours-per-week,native-country,income
0,25,Private,226802,11th,7,Never-married,Machine-op-inspct,Own-child,Black,Male,0,0,40,United-States,<=50K
5,34,Private,198693,10th,6,Never-married,Other-service,Not-in-family,White,Male,0,0,30,United-States,<=50K
9,55,Private,104996,7th-8th,4,Married-civ-spouse,Craft-repair,Husband,White,Male,0,0,10,United-States,<=50K
22,72,?,132015,7th-8th,4,Divorced,?,Not-in-family,White,Female,0,0,6,United-States,<=50K
31,56,Self-emp-not-inc,186651,11th,7,Widowed,Other-service,Unmarried,White,Female,0,0,50,United-States,<=50K
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
48798,36,Private,131459,7th-8th,4,Married-civ-spouse,Craft-repair,Husband,White,Male,0,0,40,United-States,<=50K
48803,58,Private,147707,11th,7,Married-civ-spouse,Sales,Husband,White,Male,0,0,40,United-States,<=50K
48807,32,Private,211349,10th,6,Married-civ-spouse,Transport-moving,Husband,White,Male,0,0,40,United-States,<=50K
48816,22,Private,325033,12th,8,Never-married,Protective-serv,Own-child,Black,Male,0,0,35,United-States,<=50K


In [9]:
#Groupby school education 
school=school_data.groupby('education')['education'].agg('count').sort_values(ascending=False)
school

education
11th       1812
10th       1389
7th-8th     955
9th         756
12th        657
5th-6th     509
1st-4th     247
Name: education, dtype: int64

In [10]:
#Cateogrize 1st-12th into School in original DataFrame 
data['education']= data['education'].apply(lambda x: 'School' if x in school else x)

In [11]:
data.head(10)

Unnamed: 0,age,workclass,fnlwgt,education,educational-num,marital-status,occupation,relationship,race,gender,capital-gain,capital-loss,hours-per-week,native-country,income
0,25,Private,226802,School,7,Never-married,Machine-op-inspct,Own-child,Black,Male,0,0,40,United-States,<=50K
1,38,Private,89814,HS-grad,9,Married-civ-spouse,Farming-fishing,Husband,White,Male,0,0,50,United-States,<=50K
2,28,Local-gov,336951,Assoc-acdm,12,Married-civ-spouse,Protective-serv,Husband,White,Male,0,0,40,United-States,>50K
3,44,Private,160323,Some-college,10,Married-civ-spouse,Machine-op-inspct,Husband,Black,Male,7688,0,40,United-States,>50K
4,18,?,103497,Some-college,10,Never-married,?,Own-child,White,Female,0,0,30,United-States,<=50K
5,34,Private,198693,School,6,Never-married,Other-service,Not-in-family,White,Male,0,0,30,United-States,<=50K
6,29,?,227026,HS-grad,9,Never-married,?,Unmarried,Black,Male,0,0,40,United-States,<=50K
7,63,Self-emp-not-inc,104626,Prof-school,15,Married-civ-spouse,Prof-specialty,Husband,White,Male,3103,0,32,United-States,>50K
8,24,Private,369667,Some-college,10,Never-married,Other-service,Unmarried,White,Female,0,0,40,United-States,<=50K
9,55,Private,104996,School,4,Married-civ-spouse,Craft-repair,Husband,White,Male,0,0,10,United-States,<=50K


Now replace labels in education with its rank

In [12]:
#Mapping Ranks to repsective categories
education_map ={'Preschool':1,    
               'School':2,
               'HS-grad':3,
               'Some-college':4,
               'Bachelors':5,
               'Prof-school':6,
               'Assoc-acdm':7,
               'Assoc-voc':8,
               'Masters':9,
               'Doctorate':10
}

In [13]:
data['education_label'] = data['education'].map(education_map)
data.head(10)

Unnamed: 0,age,workclass,fnlwgt,education,educational-num,marital-status,occupation,relationship,race,gender,capital-gain,capital-loss,hours-per-week,native-country,income,education_label
0,25,Private,226802,School,7,Never-married,Machine-op-inspct,Own-child,Black,Male,0,0,40,United-States,<=50K,2
1,38,Private,89814,HS-grad,9,Married-civ-spouse,Farming-fishing,Husband,White,Male,0,0,50,United-States,<=50K,3
2,28,Local-gov,336951,Assoc-acdm,12,Married-civ-spouse,Protective-serv,Husband,White,Male,0,0,40,United-States,>50K,7
3,44,Private,160323,Some-college,10,Married-civ-spouse,Machine-op-inspct,Husband,Black,Male,7688,0,40,United-States,>50K,4
4,18,?,103497,Some-college,10,Never-married,?,Own-child,White,Female,0,0,30,United-States,<=50K,4
5,34,Private,198693,School,6,Never-married,Other-service,Not-in-family,White,Male,0,0,30,United-States,<=50K,2
6,29,?,227026,HS-grad,9,Never-married,?,Unmarried,Black,Male,0,0,40,United-States,<=50K,3
7,63,Self-emp-not-inc,104626,Prof-school,15,Married-civ-spouse,Prof-specialty,Husband,White,Male,3103,0,32,United-States,>50K,6
8,24,Private,369667,Some-college,10,Never-married,Other-service,Unmarried,White,Female,0,0,40,United-States,<=50K,4
9,55,Private,104996,School,4,Married-civ-spouse,Craft-repair,Husband,White,Male,0,0,10,United-States,<=50K,2


In [14]:
#Show a particular rank is provided to each category
data.groupby('education')['education_label'].unique()

education
Assoc-acdm       [7]
Assoc-voc        [8]
Bachelors        [5]
Doctorate       [10]
HS-grad          [3]
Masters          [9]
Preschool        [1]
Prof-school      [6]
School           [2]
Some-college     [4]
Name: education_label, dtype: object

<b>Conclusion</b> : Label Encoding could be performed via Pandas fucntion, sklearn library or it can also be performed manually as we see in this particular example.

<b>References</b> 
1. https://www.youtube.com/watch?v=fxw_Ak4t-LY&list=PLZoTAELRMXVPwYGE2PXD3x0bfKnR0cJjN&index=6 <br>
2. https://github.com/krishnaik06/Complete-Feature-Engineering