<a href="https://colab.research.google.com/github/Smarth2005/Machine-Learning/blob/main/Exploratory%20Data%20Analysis/04.%20OneHot%20Encoder.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### **Use of sklearn.preprocessing.<span style="color:blue;">OneHotEncoder</span>**

<div align="justify">

`OneHotEncoder`, a transformer from scikit-learn’s preprocessing module, converts categorical features into a format that can be provided to ML algorithms. It does this by creating binary columns for each category within a feature, ensuring that the model does not assume any inherent ordinal relationship between the values.

This encoder is especially important for models like **linear regression, logistic regression, and support vector machines**, where treating categories as numbers would introduce misleading assumptions. The result is either a sparse matrix or a dense array, depending on configuration, representing the presence (`1`) or absence (`0`) of a category.

 Whether your data uses strings or integers, `OneHotEncoder` cleanly separates each category into its own column for accurate learning while preserving their qualitative nature.

>**Note:** For encoding target labels (i.e., the y vector), consider using LabelBinarizer instead of OneHotEncoder.

</div>



In [1]:
import pandas as pd

In [2]:
flowers = pd.DataFrame({
    'color' : ['red','green','red','green','red','green','red','green','blue','blue'],
    'height': [4,9,4,8,4,7,4,7.5,20,19],
    'petals': [3,9,1,8,1,10,2,8,50,47],
    'days'  : [6,16,7,15,8,17,5,12,40,45]
})

In [3]:
flowers

Unnamed: 0,color,height,petals,days
0,red,4.0,3,6
1,green,9.0,9,16
2,red,4.0,1,7
3,green,8.0,8,15
4,red,4.0,1,8
5,green,7.0,10,17
6,red,4.0,2,5
7,green,7.5,8,12
8,blue,20.0,50,40
9,blue,19.0,47,45


In [4]:
# separate independent and dependent features
X = flowers.drop('days', axis=1)
y = flowers['days']

In [5]:
from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=40)

In [6]:
from sklearn.preprocessing import OneHotEncoder
ohe = OneHotEncoder(sparse_output=False) # sparse_output=False means we want a dense matrix


**Disclaimer:**  
The **parameter names** in `OneHotEncoder` may vary depending on the version of scikit-learn installed.

- In **scikit-learn 1.2 and above**, the `sparse` parameter has been renamed to `sparse_output`, and `sparse` is deprecated.  
- The method `get_feature_names()` has also been deprecated and replaced with `get_feature_names_out()` in newer versions.

📌 Please refer to your current scikit-learn version and its documentation to ensure compatibility.


In [7]:
x_train

Unnamed: 0,color,height,petals
8,blue,20.0,50
1,green,9.0,9
2,red,4.0,1
9,blue,19.0,47
0,red,4.0,3
5,green,7.0,10
7,green,7.5,8
6,red,4.0,2


In [8]:
x_train['color'] # pd.Series

Unnamed: 0,color
8,blue
1,green
2,red
9,blue
0,red
5,green
7,green
6,red


In [9]:
x_train[['color']] # pd.DataFrame

Unnamed: 0,color
8,blue
1,green
2,red
9,blue
0,red
5,green
7,green
6,red


In [10]:
ohe.fit_transform(x_train[['color']])

array([[1., 0., 0.],
       [0., 1., 0.],
       [0., 0., 1.],
       [1., 0., 0.],
       [0., 0., 1.],
       [0., 1., 0.],
       [0., 1., 0.],
       [0., 0., 1.]])

In [11]:
pd.get_dummies(x_train['color']).astype(int)

Unnamed: 0,blue,green,red
8,1,0,0
1,0,1,0
2,0,0,1
9,1,0,0
0,0,0,1
5,0,1,0
7,0,1,0
6,0,0,1


In [12]:
x_test

Unnamed: 0,color,height,petals
4,red,4.0,1
3,green,8.0,8


In [13]:
pd.get_dummies(x_test)

Unnamed: 0,height,petals,color_green,color_red
4,4.0,1,False,True
3,8.0,8,True,False


In [14]:
ohe.transform(x_test[['color']])

array([[0., 0., 1.],
       [0., 1., 0.]])

<div align="justify">

**Advantage #1: `OneHotEncoder` preserves feature structure:** Unlike `pd.get_dummies()`, using a fitted `OneHotEncoder` ensures `X_test` is transformed using the same feature structure learned from `X_train`, eliminating the need for manual alignment.

</div>

In [15]:
flowers_new = pd.DataFrame({
    'color' : ['red','green','red','pink','red','green','red','green','blue','blue'],
    'height': [4,9,4,8,4,7,4,7.5,20,19],
    'petals': [3,9,1,8,1,10,2,8,50,47],
    'days'  : [6,16,7,15,8,17,5,12,40,45]
})
flowers_new

Unnamed: 0,color,height,petals,days
0,red,4.0,3,6
1,green,9.0,9,16
2,red,4.0,1,7
3,pink,8.0,8,15
4,red,4.0,1,8
5,green,7.0,10,17
6,red,4.0,2,5
7,green,7.5,8,12
8,blue,20.0,50,40
9,blue,19.0,47,45


In [16]:
# Again we separate independent and dependent features for the new DataFrame
X1 = flowers_new.drop('days', axis=1)
y1 = flowers_new['days']

# Data Splitting
x1_train, x1_test, y1_train, y1_test = train_test_split(X1, y1, test_size=0.2, random_state=40)

In [17]:
x1_train[['color']]

Unnamed: 0,color
8,blue
1,green
2,red
9,blue
0,red
5,green
7,green
6,red


In [18]:
x1_test[['color']]

Unnamed: 0,color
4,red
3,pink


In [19]:
enc = OneHotEncoder(sparse_output=False, handle_unknown="error")

In [20]:
enc.fit_transform(x1_train[['color']])

array([[1., 0., 0.],
       [0., 1., 0.],
       [0., 0., 1.],
       [1., 0., 0.],
       [0., 0., 1.],
       [0., 1., 0.],
       [0., 1., 0.],
       [0., 0., 1.]])

In [21]:
enc.transform(x1_test[['color']])

ValueError: Found unknown categories ['pink'] in column 0 during transform

In [22]:
enc1 = OneHotEncoder(sparse_output=False, handle_unknown='ignore')
enc1.fit_transform(x1_train[['color']])

array([[1., 0., 0.],
       [0., 1., 0.],
       [0., 0., 1.],
       [1., 0., 0.],
       [0., 0., 1.],
       [0., 1., 0.],
       [0., 1., 0.],
       [0., 0., 1.]])

In [23]:
enc1.transform(x1_test[['color']])

array([[0., 0., 1.],
       [0., 0., 0.]])

<div align="justify">

**Advantage #2: `OneHotEncoder` Handles category mismatches:**
`OneHotEncoder` handles category mismatches between training and test sets more gracefully than `pd.get_dummies()`, as it can be configured to **ignore unknown categories** during transformation (`handle_unknown='ignore'`), preventing shape misalignment and errors.

</div>

After experimenting with sample data, we now return to the real-time dataset we've been working on from the beginning.

In [24]:
from google.colab import files
uploaded = files.upload()

Saving income_evaluation.csv to income_evaluation.csv


In [25]:
df = pd.read_csv('income_evaluation.csv')
df.head()

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,income
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K


In [26]:
# separate independent and dependent features
X2 = df.drop(' income', axis=1)
y2 = df[' income']

# train_test_split
x2_train, x2_test, y2_train, y2_test = train_test_split(X2, y2, test_size=0.2, random_state=40)

In [27]:
# create OneHotEncoder object
obj = OneHotEncoder()

# By default, it is sparse_output=True (We'll explore using a sparse matrix)

<div align="justify">

As discussed earlier, we have some nominal features in this dataset wherein  all categories are equivalent in rank, meaning there is no inherent order or hierarchy among them. Therefore, One-Hot Encoding will be used for these features to avoid introducing false order.

For instance, no order can be imposed on the `workclass` feature because its categories like 'State-gov', 'Private', etc., are nominal. Similarly, features like `marital-status`, `occupation`, `relationship`, `race`, `sex`, `native-country` are Nominal features, where no natural order exists among the categories. Hence, we can use `OneHotEncoder` to encode these categorical features.

</div>

In [28]:
obj.fit_transform(x2_train[[' workclass']])

<Compressed Sparse Row sparse matrix of dtype 'float64'
	with 26048 stored elements and shape (26048, 9)>

The command returns us a sparse matrix since `sparse_output=True`. We may convert it in order to get a numpy array (dense matrix).

In [29]:
obj.fit_transform(x2_train[[' workclass']]).toarray()

array([[0., 0., 0., ..., 0., 0., 0.],
       [1., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [1., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 1., 0., 0.],
       [0., 0., 0., ..., 1., 0., 0.]])

In [30]:
pd.DataFrame(obj.fit_transform(x2_train[[' workclass']]).toarray())

Unnamed: 0,0,1,2,3,4,5,6,7,8
0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
1,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
4,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...
26043,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
26044,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
26045,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
26046,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0


In [31]:
x2_train[' workclass'].unique()

array([' Private', ' ?', ' Local-gov', ' State-gov', ' Self-emp-not-inc',
       ' Self-emp-inc', ' Federal-gov', ' Never-worked', ' Without-pay'],
      dtype=object)

We have 9 unique categories in this feature, hence, we get 9 columns in the dataframe. We can observe that in one-hot encoding, exactly one category is represented as 'hot' (1), and all others as 'cold' (0), ensuring no ordinal relationship between categories.

In [33]:
obj.transform(x2_test[[' workclass']]).toarray()

array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 1., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [1., 0., 0., ..., 0., 0., 0.],
       [0., 0., 1., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]])

We can pass multiple categorical features to the OneHotEncoder object, allowing it to encode all of them in a single step. Let's explore.

In [35]:
x2_train[[' workclass']].nunique()

Unnamed: 0,0
workclass,9


In [36]:
x2_train[[' occupation']].nunique()

Unnamed: 0,0
occupation,15


In [37]:
# We create another OneHot Encoder object for experimentation.
obj2 = OneHotEncoder(sparse_output=False)

In [38]:
obj2.fit_transform(x2_train[[' workclass',' occupation']])

array([[0., 0., 0., ..., 0., 0., 0.],
       [1., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 1., 0., 0.],
       ...,
       [1., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]])

In [39]:
pd.DataFrame(obj2.fit_transform(x2_train[[' workclass',' occupation']]))

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,14,15,16,17,18,19,20,21,22,23
0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
3,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
26043,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
26044,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
26045,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
26046,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


⚠️ While the one-hot encoded output is technically correct, its default numeric column names make it visually unappealing and difficult to interpret. Assigning meaningful column names will significantly improve readability and understanding.

In [41]:
obj2.get_feature_names_out()

array([' workclass_ ?', ' workclass_ Federal-gov',
       ' workclass_ Local-gov', ' workclass_ Never-worked',
       ' workclass_ Private', ' workclass_ Self-emp-inc',
       ' workclass_ Self-emp-not-inc', ' workclass_ State-gov',
       ' workclass_ Without-pay', ' occupation_ ?',
       ' occupation_ Adm-clerical', ' occupation_ Armed-Forces',
       ' occupation_ Craft-repair', ' occupation_ Exec-managerial',
       ' occupation_ Farming-fishing', ' occupation_ Handlers-cleaners',
       ' occupation_ Machine-op-inspct', ' occupation_ Other-service',
       ' occupation_ Priv-house-serv', ' occupation_ Prof-specialty',
       ' occupation_ Protective-serv', ' occupation_ Sales',
       ' occupation_ Tech-support', ' occupation_ Transport-moving'],
      dtype=object)

In [42]:
pd.set_option('display.max_columns',None)

In [43]:
pd.DataFrame(obj2.fit_transform(x2_train[[' workclass',' occupation']]), columns=obj2.get_feature_names_out())

Unnamed: 0,workclass_ ?,workclass_ Federal-gov,workclass_ Local-gov,workclass_ Never-worked,workclass_ Private,workclass_ Self-emp-inc,workclass_ Self-emp-not-inc,workclass_ State-gov,workclass_ Without-pay,occupation_ ?,occupation_ Adm-clerical,occupation_ Armed-Forces,occupation_ Craft-repair,occupation_ Exec-managerial,occupation_ Farming-fishing,occupation_ Handlers-cleaners,occupation_ Machine-op-inspct,occupation_ Other-service,occupation_ Priv-house-serv,occupation_ Prof-specialty,occupation_ Protective-serv,occupation_ Sales,occupation_ Tech-support,occupation_ Transport-moving
0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
3,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
26043,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
26044,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
26045,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
26046,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [45]:
categorical_features = [col for col in x2_train.columns if x2_train[col].dtypes == 'O']

In [46]:
categorical_features

[' workclass',
 ' education',
 ' marital-status',
 ' occupation',
 ' relationship',
 ' race',
 ' sex',
 ' native-country']

In [47]:
x2_train.info()

<class 'pandas.core.frame.DataFrame'>
Index: 26048 entries, 11931 to 11590
Data columns (total 14 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   age              26048 non-null  int64 
 1    workclass       26048 non-null  object
 2    fnlwgt          26048 non-null  int64 
 3    education       26048 non-null  object
 4    education-num   26048 non-null  int64 
 5    marital-status  26048 non-null  object
 6    occupation      26048 non-null  object
 7    relationship    26048 non-null  object
 8    race            26048 non-null  object
 9    sex             26048 non-null  object
 10   capital-gain    26048 non-null  int64 
 11   capital-loss    26048 non-null  int64 
 12   hours-per-week  26048 non-null  int64 
 13   native-country  26048 non-null  object
dtypes: int64(6), object(8)
memory usage: 3.0+ MB


In [48]:
obj3 = OneHotEncoder(sparse_output=False)
obj3.fit_transform(x2_train[categorical_features])

array([[0., 0., 0., ..., 0., 0., 0.],
       [1., 0., 0., ..., 1., 0., 0.],
       [0., 0., 0., ..., 1., 0., 0.],
       ...,
       [1., 0., 0., ..., 1., 0., 0.],
       [0., 0., 0., ..., 1., 0., 0.],
       [0., 0., 0., ..., 1., 0., 0.]])

In [49]:
pd.DataFrame(obj3.fit_transform(x2_train[categorical_features])).shape

(26048, 102)

<div align="justify">

One-hot encoding has significantly expanded the feature space, as each categorical feature is transformed into multiple binary columns, potentially increasing dimensionality and impacting model complexity.

While this allows machine learning models to work with categorical data, it also introduces a potential issue known as the **dummy variable trap**. This trap occurs when **all dummy variables are included, leading to perfect multicollinearity** — where one column is a linear combination of the others. This can confuse linear models like Linear Regression, causing unstable or misleading coefficient estimates.

To prevent this, we often **drop one of the dummy columns** using `drop='first' in OneHotEncoder`, or `drop_first=True in pd.get_dummies()`. This makes the dropped category **the baseline**, and all other categories are interpreted relative to it, maintaining model stability.

Additionally, when working with **binary categorical features** (i.e., features with only two categories), encoding both values may be redundant. In such cases, setting `OneHotEncoder(drop='if_binary')` ensures that only one column is created for binary variables, further reducing the feature space and avoiding unnecessary multicollinearity.

</div>

In [50]:
obj4 = OneHotEncoder(sparse_output=False, drop='first')
obj4.fit_transform(x2_train[categorical_features])

array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 1., 0., 0.],
       [0., 0., 0., ..., 1., 0., 0.],
       ...,
       [0., 0., 0., ..., 1., 0., 0.],
       [0., 0., 0., ..., 1., 0., 0.],
       [0., 0., 0., ..., 1., 0., 0.]])

In [51]:
pd.DataFrame(obj4.fit_transform(x2_train[categorical_features])).shape

(26048, 94)

In [52]:
len(categorical_features)

8

By setting drop='first', the first category of each of the 8 categorical features has been dropped to avoid the dummy variable trap. As a result, the total number of columns has reduced from 102 to 94.