Data Encoding in Data Science

Data encoding is the process of converting categorical (text) data into numerical values so that machine learning models can understand and process it.

ðŸ‘‰ Machine learning algorithms work with numbers, not text.


Gender: Male, Female
Encoded as:
Male â†’ 1
Female â†’ 0


Why is Data Encoding Important?

ML models are based on mathematical calculations

Text values cannot be used directly

Encoding converts text into a usable numeric format

Types of Data Encoding


1. Label Encoding

Each category is assigned a unique number.

City:
Mumbai â†’ 0
Delhi  â†’ 1
Pune   â†’ 2

When to use:

Binary categories (Yes/No, True/False)

Limitation:

The model may assume an order that does not actually exist

2. One-Hot Encoding.
Each category is converted into a separate binary column.

City_Mumbai  City_Delhi  City_Pune
1            0           0
0            1           0
0            0           1




Advantages:

No false ordering

Works well with most ML models

Disadvantage:

Too many categories create too many columns



3. Ordinal Encoding

Used when categories have a natural order.

Example:

Education Level:
School â†’ 1
College â†’ 2
Postgraduate â†’ 3
Use case:

Rankings, ratings, levels

4. Frequency (Count) Encoding

Each category is replaced by its frequency in the dataset.

Example:

City     Frequency
Mumbai â†’ 100
Delhi  â†’ 50


Data encoding is the process of transforming categorical variables into numerical representations so that machine learning algorithms can process them effectively.

In [1]:

#Step 1: Create a Real Dataset
import pandas as pd

df = pd.DataFrame({
    "Gender": ["Male", "Female", "Female", "Male", "Female"],
    "City": ["Mumbai", "Delhi", "Pune", "Mumbai", "Delhi"],
    "Education": ["Graduate", "Postgraduate", "Graduate", "PhD", "Postgraduate"],
    "Salary": [30000, 45000, 32000, 70000, 48000]
})

df


Unnamed: 0,Gender,City,Education,Salary
0,Male,Mumbai,Graduate,30000
1,Female,Delhi,Postgraduate,45000
2,Female,Pune,Graduate,32000
3,Male,Mumbai,PhD,70000
4,Female,Delhi,Postgraduate,48000


In [2]:
#Label Encoding (Binary Column)

from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()
df["Gender_encoded"] = le.fit_transform(df["Gender"])

df[["Gender", "Gender_encoded"]]


Unnamed: 0,Gender,Gender_encoded
0,Male,1
1,Female,0
2,Female,0
3,Male,1
4,Female,0


In [3]:
#One-Hot Encoding
city_encoded = pd.get_dummies(df["City"], prefix="City")
city_encoded


Unnamed: 0,City_Delhi,City_Mumbai,City_Pune
0,False,True,False
1,True,False,False
2,False,False,True
3,False,True,False
4,True,False,False


In [4]:
# Ordinal Encoding

from sklearn.preprocessing import OrdinalEncoder

edu_order = [["Graduate", "Postgraduate", "PhD"]]

oe = OrdinalEncoder(categories=edu_order)
df["Education_encoded"] = oe.fit_transform(df[["Education"]])

df[["Education", "Education_encoded"]]


Unnamed: 0,Education,Education_encoded
0,Graduate,0.0
1,Postgraduate,1.0
2,Graduate,0.0
3,PhD,2.0
4,Postgraduate,1.0


In [5]:
final_df = pd.concat(
    [df[["Gender_encoded", "Education_encoded", "Salary"]], city_encoded],
    axis=1
)

final_df


Unnamed: 0,Gender_encoded,Education_encoded,Salary,City_Delhi,City_Mumbai,City_Pune
0,1,0.0,30000,False,True,False
1,0,1.0,45000,True,False,False
2,0,0.0,32000,False,False,True
3,1,2.0,70000,False,True,False
4,0,1.0,48000,True,False,False
