Data Encoding

1.Nominal/OHE Encoding
2.Label and Ordinal Encoding
3.Target Guided Ordinal Encoding

Nominal/OHE Encoding
One hot encoding, also known as nominal encoding, is a technique used to represent categorical data as numerical data, which is more suitable for machine learning algorithms. In this technique, each category is represented as a binary vector where each bit corresponds to a unique category. For example, if we have a categorical variable "color" with three possible values (red, green, blue), we can represent it using one hot encoding as follows:

Red: [1, 0, 0]
Green: [0, 1, 0]
Blue: [0, 0, 1]

What it does:
Turns each category into a separate column with 0 or 1 (binary values).
Avoids misleading numerical relationships from label encoding.

["Red", "Green", "Blue"]→
Red   Green  Blue
[1,     0,     0]
[0,     1,     0]
[0,     0,     1]

Use when:
Categories are non-ordinal (no order).
You want to avoid implying any relationship between the categories.
Drawback:
Can lead to many columns with high-cardinality data (many unique categories).

In [2]:
import pandas as pd
from sklearn.preprocessing import OneHotEncoder

In [4]:
## Create a simple dataframe 
df = pd.DataFrame({
    'color': ['red', 'blue', 'green', 'green', 'red', 'blue']
})

In [6]:
df.head()

Unnamed: 0,color
0,red
1,blue
2,green
3,green
4,red


In [8]:
##create an instance of Onehotencoder
encoder=OneHotEncoder()

In [10]:
## perform fit and transform
encoded=encoder.fit_transform(df[['color']]).toarray()

In [12]:
import pandas as pd
encoder_df=pd.DataFrame(encoded,columns=encoder.get_feature_names_out())

In [14]:
encoder_df

Unnamed: 0,color_blue,color_green,color_red
0,0.0,0.0,1.0
1,1.0,0.0,0.0
2,0.0,1.0,0.0
3,0.0,1.0,0.0
4,0.0,0.0,1.0
5,1.0,0.0,0.0


In [16]:
## for new data
encoder.transform([['blue']]).toarray()



array([[1., 0., 0.]])

In [18]:
pd.concat([df,encoder_df],axis=1)

Unnamed: 0,color,color_blue,color_green,color_red
0,red,0.0,0.0,1.0
1,blue,1.0,0.0,0.0
2,green,0.0,1.0,0.0
3,green,0.0,1.0,0.0
4,red,0.0,0.0,1.0
5,blue,1.0,0.0,0.0


In [20]:
import seaborn as sns
sns.load_dataset('tips')

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
0,16.99,1.01,Female,No,Sun,Dinner,2
1,10.34,1.66,Male,No,Sun,Dinner,3
2,21.01,3.50,Male,No,Sun,Dinner,3
3,23.68,3.31,Male,No,Sun,Dinner,2
4,24.59,3.61,Female,No,Sun,Dinner,4
...,...,...,...,...,...,...,...
239,29.03,5.92,Male,No,Sat,Dinner,3
240,27.18,2.00,Female,Yes,Sat,Dinner,2
241,22.67,2.00,Male,Yes,Sat,Dinner,2
242,17.82,1.75,Male,No,Sat,Dinner,2


A sparse matrix is a matrix (or table) that contains mostly zeros.

In the context of Nominal Encoding (like One-Hot Encoding), each category becomes a new column. If you have many categories, most values in each row will be 0 — hence, it becomes sparse.
Only one 1 per row — rest are zeros. This wastes memory and can slow down computations if not stored in an optimized format (like scipy.sparse).
 Drawback of Nominal (One-Hot) Encoding: Overfitting
Why overfitting happens:
Too many features (columns) are created — especially if the category has many unique values (high cardinality).

The model may start to memorize specific category combinations instead of learning general patterns.

This reduces performance on unseen/test data.
Example:
Say you have a feature called City with 1,000 unique cities.

One-Hot Encoding will create 1,000 new columns like:
City_NewYork | City_London | City_Tokyo | ...
1 | 0 | 0 | ...
If you only have, say, 500 rows of data, you now have more features than data points, which is dangerous. The model can start "memorizing" rather than learning.

How to Handle These Issues:
Use sparse=True in OneHotEncoder to save memory.
Consider dimensionality reduction (e.g., PCA).
Try Target Encoding or Frequency Encoding for high-cardinality features.
Regularize your model to avoid overfitting.

Label Encoding
Label encoding and ordinal encoding are two techniques used to encode categorical data as numerical data.

Label encoding involves assigning a unique numerical label to each category in the variable. The labels are usually assigned in alphabetical order or based on the frequency of the categories. For example, if we have a categorical variable "color" with three possible values (red, green, blue), we can represent it using label encoding as follows:

Red: 1
Green: 2
Blue: 3

What it does:
Converts each category into a unique number (integer).
Used for ordinal or ordered data, but can be misused for nominal data.

Example
["Low", "Medium", "High"]
[0, 1, 2]

If applied to
["Red", "Green", "Blue"]
[0, 1, 2]  # (order doesn't matter here)

Risk: ML models may interpret 2 > 1 > 0 as meaningful when it isn't.

Use when:
Categories have a natural order (e.g. "cold", "warm", "hot").
You’re using tree-based models (they can handle label-encoded nominal data better).

In [22]:
df.head()

Unnamed: 0,color
0,red
1,blue
2,green
3,green
4,red


In [24]:
from sklearn.preprocessing import LabelEncoder
lbl_encoder=LabelEncoder()

In [26]:
lbl_encoder.fit_transform(df[['color']])

  y = column_or_1d(y, warn=True)


array([2, 0, 1, 1, 2, 0])

In [28]:
lbl_encoder.transform([['red']])

  y = column_or_1d(y, dtype=self.classes_.dtype, warn=True)


array([2])

In [30]:
lbl_encoder.transform([['blue']])

  y = column_or_1d(y, dtype=self.classes_.dtype, warn=True)


array([0])

In [32]:
lbl_encoder.transform([['green']])

  y = column_or_1d(y, dtype=self.classes_.dtype, warn=True)


array([1])

Ordinal Encoding
It is used to encode categorical data that have an intrinsic order or ranking. In this technique, each category is assigned a numerical value based on its position in the order. For example, if we have a categorical variable "education level" with four possible values (high school, college, graduate, post-graduate), we can represent it using ordinal encoding as follows:

High school: 1
College: 2
Graduate: 3
Post-graduate: 4

In [34]:
#Ordinal Encoding
from sklearn.preprocessing import OrdinalEncoder

In [36]:
# create a sample dataframe with an ordinal variable
df = pd.DataFrame({
    'size': ['small', 'medium', 'large', 'medium', 'small', 'large']
})

In [38]:
df

Unnamed: 0,size
0,small
1,medium
2,large
3,medium
4,small
5,large


In [40]:
## create an instance of ORdinalEncoder and then fit_transform
encoder=OrdinalEncoder(categories=[['small','medium','large']])

In [42]:
encoder.fit_transform(df[['size']])

array([[0.],
       [1.],
       [2.],
       [1.],
       [0.],
       [2.]])

In [44]:
encoder.transform([['small']])



array([[0.]])

Label Encoding:
It converts categorical labels into numeric values, like:

["Red", "Green", "Blue"] → [0, 1, 2]
The labels are just replaced with integers.
No assumption of order — it’s just assigning numbers arbitrarily.
Example use: LabelEncoder from scikit-learn.
Use case: Good for target variable encoding or when categories have no meaningful order and are used with tree-based models.

Ordinal Encoding:
Also converts categories to integers, but assumes there is a meaningful order between categories.

["Low", "Medium", "High"] → [0, 1, 2]
The order matters here, and the numbers imply ranking.
Use case: Best when your categorical data has a natural order, like "small, medium, large" or "beginner, intermediate, advanced".