<a href="https://colab.research.google.com/github/Smarth2005/Machine-Learning/blob/main/Exploratory%20Data%20Analysis/Ordinal%20Encoder.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### **Use of sklearn.preprocessing.<span style="color:blue;">OrdinalEncoder</span>**

<div align="justify">

`OrdinalEncoder` turns categorical features into numbers.

It takes input as a list or array of text labels or numbers (like country names, education levels, etc.) and replaces each unique value with an integer. Each feature is encoded into one column where the values range from `0` to `n_categories - 1`, depending on how many unique values there are.

This helps models work with categorical data by converting it into numeric form.
</div>

In [1]:
import pandas as pd
from google.colab import files
uploaded = files.upload()

Saving income_evaluation.csv to income_evaluation (1).csv


In [2]:
df = pd.read_csv('income_evaluation.csv')
df.head()

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,income
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K


In [3]:
df.shape

(32561, 15)

In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32561 entries, 0 to 32560
Data columns (total 15 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   age              32561 non-null  int64 
 1    workclass       32561 non-null  object
 2    fnlwgt          32561 non-null  int64 
 3    education       32561 non-null  object
 4    education-num   32561 non-null  int64 
 5    marital-status  32561 non-null  object
 6    occupation      32561 non-null  object
 7    relationship    32561 non-null  object
 8    race            32561 non-null  object
 9    sex             32561 non-null  object
 10   capital-gain    32561 non-null  int64 
 11   capital-loss    32561 non-null  int64 
 12   hours-per-week  32561 non-null  int64 
 13   native-country  32561 non-null  object
 14   income          32561 non-null  object
dtypes: int64(6), object(9)
memory usage: 3.7+ MB


In [5]:
df.columns

Index(['age', ' workclass', ' fnlwgt', ' education', ' education-num',
       ' marital-status', ' occupation', ' relationship', ' race', ' sex',
       ' capital-gain', ' capital-loss', ' hours-per-week', ' native-country',
       ' income'],
      dtype='object')

In [6]:
# Always split the data before encoding to prevent data leakage, where test data influences training, and to ensure proper model generalization.
# First, separate independent and dependent features
X = df.drop(' income', axis=1)
y = df[' income']

# train_test_split
from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split(X, y, test_size = 0.20, random_state=0)

In [7]:
# Ordinal Encoder is basically for used for encoding ordinal data — categories that have a meaningful order.
x_train[' education'].value_counts()

Unnamed: 0_level_0,count
education,Unnamed: 1_level_1
HS-grad,8450
Some-college,5832
Bachelors,4242
Masters,1414
Assoc-voc,1110
11th,920
Assoc-acdm,817
10th,752
7th-8th,526
Prof-school,459


For instance, no order can be imposed on the `workclass` feature because its categories like 'State-gov', 'Private', etc., are nominal. In nominal data, all categories are equivalent in rank, meaning there is no inherent order or hierarchy among them. So, `One-Hot Encoder` will be used to represent its categories without implying any order.


In [8]:
x_train[' education'].unique()

array([' 11th', ' HS-grad', ' Bachelors', ' Assoc-voc', ' Some-college',
       ' 9th', ' 10th', ' 12th', ' Doctorate', ' Prof-school', ' Masters',
       ' Assoc-acdm', ' 7th-8th', ' 5th-6th', ' Preschool', ' 1st-4th'],
      dtype=object)

In [9]:
edu = [' Preschool',' 1st-4th', ' 5th-6th', ' 7th-8th',' 9th', ' 10th', ' 11th', ' 12th',
       ' HS-grad', ' Prof-school', ' Some-college', ' Assoc-acdm', ' Assoc-voc',
       ' Bachelors', ' Masters', ' Doctorate']

In [10]:
from sklearn.preprocessing import OrdinalEncoder
ordi = OrdinalEncoder(categories=[edu])

In [11]:
pd.DataFrame(ordi.fit_transform(x_train[[' education']]))

Unnamed: 0,0
0,6.0
1,8.0
2,13.0
3,8.0
4,12.0
...,...
26043,14.0
26044,5.0
26045,10.0
26046,15.0


In [12]:
x_train[' education']

Unnamed: 0,education
15282,11th
24870,HS-grad
18822,Bachelors
26404,HS-grad
7842,Assoc-voc
...,...
13123,Masters
19648,10th
9845,Some-college
10799,Doctorate


In [13]:
x_train.head()

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country
15282,36,Private,174308,11th,7,Divorced,Transport-moving,Not-in-family,White,Male,0,0,40,United-States
24870,35,Private,198202,HS-grad,9,Never-married,Exec-managerial,Not-in-family,White,Female,0,0,54,United-States
18822,38,Private,52963,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Female,0,0,50,United-States
26404,50,Private,138270,HS-grad,9,Married-civ-spouse,Sales,Wife,Black,Female,0,0,40,United-States
7842,68,Self-emp-not-inc,116903,Assoc-voc,11,Married-civ-spouse,Prof-specialty,Husband,White,Male,0,2149,40,United-States


In [14]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32561 entries, 0 to 32560
Data columns (total 15 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   age              32561 non-null  int64 
 1    workclass       32561 non-null  object
 2    fnlwgt          32561 non-null  int64 
 3    education       32561 non-null  object
 4    education-num   32561 non-null  int64 
 5    marital-status  32561 non-null  object
 6    occupation      32561 non-null  object
 7    relationship    32561 non-null  object
 8    race            32561 non-null  object
 9    sex             32561 non-null  object
 10   capital-gain    32561 non-null  int64 
 11   capital-loss    32561 non-null  int64 
 12   hours-per-week  32561 non-null  int64 
 13   native-country  32561 non-null  object
 14   income          32561 non-null  object
dtypes: int64(6), object(9)
memory usage: 3.7+ MB


In [15]:
x_train[' marital-status'].unique()

array([' Divorced', ' Never-married', ' Married-civ-spouse', ' Separated',
       ' Widowed', ' Married-spouse-absent', ' Married-AF-spouse'],
      dtype=object)

In [16]:
x_train[' relationship'].unique()

array([' Not-in-family', ' Wife', ' Husband', ' Other-relative',
       ' Own-child', ' Unmarried'], dtype=object)

In [17]:
x_train[' occupation'].unique()

array([' Transport-moving', ' Exec-managerial', ' Adm-clerical', ' Sales',
       ' Prof-specialty', ' Farming-fishing', ' Machine-op-inspct', ' ?',
       ' Other-service', ' Craft-repair', ' Protective-serv',
       ' Tech-support', ' Handlers-cleaners', ' Priv-house-serv',
       ' Armed-Forces'], dtype=object)

Features like `marital-status`, `occupation`, `relationship`, `race`, `sex`, `native-country` are Nominal features, where no natural order exists among the categories. Therefore, One-Hot Encoding will be used for these features to avoid introducing false order.

In [18]:
x_train[' education'] = ordi.fit_transform(x_train[[' education']])

In [19]:
pd.Series(ordi.transform(x_test[[' education']]).ravel())

Unnamed: 0,0
0,10.0
1,13.0
2,11.0
3,2.0
4,6.0
...,...
6508,10.0
6509,6.0
6510,13.0
6511,8.0


`OrdinalEncoder.transform()` returns a **2D array, even for one column.**

`pd.Series()` expects a **1D array**, so you use `.ravel()` to flatten it.