<a href="https://colab.research.google.com/github/Smarth2005/Machine-Learning/blob/main/Exploratory%20Data%20Analysis/Label%20Encoder.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### **Use of sklearn.preprocessing.<span style="color:blue;">LabelEncoder</span>**

<div align="justify">

`LabelEncoder` is a Scikit-learn utility that converts **categorical labels into integer values.** It maps each **class label** (object or category) to a unique integer. This is particularly useful when a machine learning model requires numerical labels, such as in classification tasks.

It encodes target labels with values ranging from `0` to `n_classes - 1`.

<span style="color:red;">This transformer should be used to encode target values, i.e., **$y$**, and not the input features **$X$**.</span></div>

In [1]:
import pandas as pd
import numpy  as np

In [2]:
from google.colab import files
uploaded = files.upload()

Saving income_evaluation.csv to income_evaluation (1).csv


In [3]:
df = pd.read_csv('income_evaluation.csv')
df.head()

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,income
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K


In [4]:
df.shape

(32561, 15)

In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32561 entries, 0 to 32560
Data columns (total 15 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   age              32561 non-null  int64 
 1    workclass       32561 non-null  object
 2    fnlwgt          32561 non-null  int64 
 3    education       32561 non-null  object
 4    education-num   32561 non-null  int64 
 5    marital-status  32561 non-null  object
 6    occupation      32561 non-null  object
 7    relationship    32561 non-null  object
 8    race            32561 non-null  object
 9    sex             32561 non-null  object
 10   capital-gain    32561 non-null  int64 
 11   capital-loss    32561 non-null  int64 
 12   hours-per-week  32561 non-null  int64 
 13   native-country  32561 non-null  object
 14   income          32561 non-null  object
dtypes: int64(6), object(9)
memory usage: 3.7+ MB


In [6]:
df.columns

Index(['age', ' workclass', ' fnlwgt', ' education', ' education-num',
       ' marital-status', ' occupation', ' relationship', ' race', ' sex',
       ' capital-gain', ' capital-loss', ' hours-per-week', ' native-country',
       ' income'],
      dtype='object')

In [7]:
df[' income'].unique()

array([' <=50K', ' >50K'], dtype=object)

It is **crucial to split the dataset into training and testing subsets before applying** `LabelEncoder`. This prevents **data leakage**, which occurs when information from the test set unintentionally influences the training process, thereby compromising the model's generalization ability.

In [8]:
# first separate independent and dependent features
X = df.drop(' income', axis=1)
y = df[' income']

In [9]:
# train_test_split
from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)


<div align="justify">

**Why this order matters?**

>If you apply `LabelEncoder` on the entire target variable $y$ before splitting, the encoder learns mappings from labels it should not yet have access to—namely, the labels in the test set. This introduces bias and violates the principles of a fair training procedure.

**Now, Initialize and Fit the LabelEncoder on Training Labels**

At this stage, you initialize an instance of `LabelEncoder` and **fit it only on the training labels** (`y_train`). This allows the encoder to learn the mapping between the original labels and their corresponding integer representations based **solely on the training data.**
</div>

In [10]:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
le.fit(y_train)

In [11]:
le.classes_

array([' <=50K', ' >50K'], dtype=object)

In [12]:
le.transform(y_train)

array([0, 0, 0, ..., 0, 1, 0])

In [13]:
pd.Series(le.transform(y_train)).head(20)

Unnamed: 0,0
0,0
1,0
2,0
3,0
4,0
5,0
6,0
7,0
8,0
9,0


In [14]:
y_train.head(20)

Unnamed: 0,income
15282,<=50K
24870,<=50K
18822,<=50K
26404,<=50K
7842,<=50K
4890,<=50K
3243,<=50K
17470,<=50K
14211,<=50K
22453,<=50K



`le.fit_transform(y_train)` works well and is a **shorthand method** that simultaneously fits the LabelEncoder on the training labels and transforms them into numeric form. It is considered the de facto standard for encoding target labels efficiently.



**Next Step: Transform the Test Labels Using the Fitted Encoder**

Now that the encoder has been trained on the training set, we can safely transform the test labels using the same mapping without retraining it.
Calling `fit_transform()` again would recalculate mappings, which would be inconsistent and incorrect.

In [15]:
le.transform(y_test)

array([0, 0, 0, ..., 1, 0, 1])

In [16]:
y_test_encoded = le.transform(y_test)

#### **Why is `LabelEncoder` not suitable for encoding categorical features ?**

In [17]:
le1 = LabelEncoder()
le1.fit_transform(x_train[' native-country'])

array([39, 39, 39, ..., 39, 39, 39])

In [18]:
x_train[' native-country'].head(20)

Unnamed: 0,native-country
15282,United-States
24870,United-States
18822,United-States
26404,United-States
7842,United-States
4890,United-States
3243,Mexico
17470,United-States
14211,United-States
22453,United-States


In [19]:
x_train[' native-country_encoded'] = le1.fit_transform(x_train[' native-country'])

In [20]:
x_train[' native-country_encoded'].head(20)

Unnamed: 0,native-country_encoded
15282,39
24870,39
18822,39
26404,39
7842,39
4890,39
3243,26
17470,39
14211,39
22453,39


In [21]:
print(list(enumerate(le1.classes_)))

[(0, ' ?'), (1, ' Cambodia'), (2, ' Canada'), (3, ' China'), (4, ' Columbia'), (5, ' Cuba'), (6, ' Dominican-Republic'), (7, ' Ecuador'), (8, ' El-Salvador'), (9, ' England'), (10, ' France'), (11, ' Germany'), (12, ' Greece'), (13, ' Guatemala'), (14, ' Haiti'), (15, ' Holand-Netherlands'), (16, ' Honduras'), (17, ' Hong'), (18, ' Hungary'), (19, ' India'), (20, ' Iran'), (21, ' Ireland'), (22, ' Italy'), (23, ' Jamaica'), (24, ' Japan'), (25, ' Laos'), (26, ' Mexico'), (27, ' Nicaragua'), (28, ' Outlying-US(Guam-USVI-etc)'), (29, ' Peru'), (30, ' Philippines'), (31, ' Poland'), (32, ' Portugal'), (33, ' Puerto-Rico'), (34, ' Scotland'), (35, ' South'), (36, ' Taiwan'), (37, ' Thailand'), (38, ' Trinadad&Tobago'), (39, ' United-States'), (40, ' Vietnam'), (41, ' Yugoslavia')]


This numeric encoding fools the model into assuming:

- There's a mathematical difference between Mexico and the US: (39 - 26) = 13
- That larger numbers are more significant.

That has no logical meaning — countries are nominal categories, not ordered levels.


#### ❗ **The Problem:**

<div align="justify">

`LabelEncoder` assigns **arbitrary integer values to each unique category**  based on alphabetical order in a feature. While this works fine for target variables (y), applying it to input features (X) can be misleading and harmful, especially for models that interpret numerical values as having order or magnitude.

Machine learning models like Logistic Regression, Linear Regression, KNN, SVM, and even some tree-based models might interpret these integers as having numeric meaning — leading to biased, inaccurate, or unstable models.

#### **Correct Alternatives:-**

Depending on the type of categorical feature, you should use:

1. `OneHotEncoder` &nbsp; (for nominal/unordered categories)
2. `OrdinalEncoder` (for ordered categories)

</div>

#### **Key Takeaway:**

>`LabelEncoder` is designed for encoding target variables, not categorical features. For input features, always consider whether the data is nominal or ordinal, and use the appropriate encoder.