## Encoding Categorical Variables

Encoding categorical variable is a crucial step in preparing data for machine learning algorithms. Most algorithms can only handle numerical data, so categorical variables need to be converted into a numerical format before training a model.

### Qualitative Data (Categorical Data)

Qualitative data describes categories or qualities rather than numerical values. It answers questions like "what kind" or "which category." This data is divided into:

* A. Nominal Data (one hot encoding)
* B. Ordinal Data (ordinal encoding, label encoding)

### Why Encode Categorical Data?

Many datasets include categorical variables (features or labels that represent categories, not numerical values) such as:

* Gender (Male, Female)
* Country (USA, UK, China)
* Product Categories (Electronics, Clothing, Furniture)

**Types of Encoding**

* Label Encoding
* One-Hot Encoding
* Ordinal Encoding

**1. Label Encoding**

* Encode target labels with value between 0 and n_classes-1.
* This transformer should be used to encode target values, i.e. y, and not the input X.

In [1]:
import pandas as pd
from sklearn.preprocessing import LabelEncoder

data = {'Country': ['USA', 'Canada', 'Mexico', 'Canada', 'USA']}

df = pd.DataFrame(data)

# Initialize LabelEncoder
label_encoder = LabelEncoder()

# Apply label encoding
df['Country_Encoded'] = label_encoder.fit_transform(df['Country'])

print(df)

  Country  Country_Encoded
0     USA                2
1  Canada                0
2  Mexico                1
3  Canada                0
4     USA                2


**2. Ordinal Encoding**

Ordinal categorical variables have a clear, meaningful order or ranking among their categories, but the intervals between the ranks may not be equal or measurable. In ordinal encoding, each category is assigned a unique integer value based on its order, preserving the ranking information.

* Ordinal encoding should only be applied to **independent features (X)** and not the **target (Y)** variable.

In [2]:
from sklearn.preprocessing import OrdinalEncoder

data = {'Satisfaction_Level': ['Very Unsatisfied', 'Unsatisfied', 'Neutral', 'Satisfied', 'Very Satisfied',
                               'Neutral', 'Very Satisfied', 'Unsatisfied', 'Very Unsatisfied']}

df = pd.DataFrame(data)

satisfaction_order = ['Very Unsatisfied', 'Unsatisfied', 'Neutral', 'Satisfied', 'Very Satisfied']

# Initialize OrdinalEncoder
encoder = OrdinalEncoder(categories=[satisfaction_order])

# Apply ordinal encoding
df['Satisfaction_Encoded'] = encoder.fit_transform(df[['Satisfaction_Level']])

print(df)

  Satisfaction_Level  Satisfaction_Encoded
0   Very Unsatisfied                   0.0
1        Unsatisfied                   1.0
2            Neutral                   2.0
3          Satisfied                   3.0
4     Very Satisfied                   4.0
5            Neutral                   2.0
6     Very Satisfied                   4.0
7        Unsatisfied                   1.0
8   Very Unsatisfied                   0.0


#### Ordinal vs Label

In [3]:
data = {'Education': ['High School', 'Bachelor', 'Master', 'PhD'],
        'Target': ['Employed', 'Unemployed', 'Employed', 'Unemployed']}

df = pd.DataFrame(data)

# Ordinal encoding for independent feature (X)
education_order = ['High School', 'Bachelor', 'Master', 'PhD']
ordinal_encoder = OrdinalEncoder(categories=[education_order])
df['Education_Encoded'] = ordinal_encoder.fit_transform(df[['Education']])

# Label encoding for target (Y)
label_encoder = LabelEncoder()
df['Target_Encoded'] = label_encoder.fit_transform(df['Target'])

print(df)

     Education      Target  Education_Encoded  Target_Encoded
0  High School    Employed                0.0               0
1     Bachelor  Unemployed                1.0               1
2       Master    Employed                2.0               0
3          PhD  Unemployed                3.0               1


**3. One-Hot Encoding**

The basic idea is that we create a new binary column for each possible category, and for each row, we assign 1 to the column corresponding to the category of that row, and 0 for all other columns.

**Why use One-Hot Encoding?**

* Handling categorical data: Machine learning algorithms typically require numerical data. One-hot encoding allows us to use categorical variables without introducing any false relationships.
* Avoid ordinal relationship: One-hot encoding helps prevent models from incorrectly assuming that higher numerical values (from label encoding) represent a higher rank or importance.

**Ways to Use One-Hot Encoding?**

**1. Using Pandas' get_dummies()**
Pandas provides a built-in function to perform one-hot encoding called get_dummies().

In [16]:
data = {'Color': ['Red', 'Green', 'Blue', 'Red', 'Blue']}
df = pd.DataFrame(data)

# Apply one-hot encoding using Pandas
one_hot_df = pd.get_dummies(df, columns=['Color'], drop_first=True)

# Ensure that the values are converted to integers (1/0)
one_hot_df = one_hot_df.astype(int)

print(one_hot_df)

   Color_Green  Color_Red
0            0          1
1            1          0
2            0          0
3            0          1
4            0          0


In [17]:
one_hot_df.shape

(5, 2)

**2. Using OneHotEncoder from Scikit-learn**

In [51]:
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder

file_path = 'C:/Users/hp/Desktop/Machine Learning/Datasets/cars.csv'
df = pd.read_csv(file_path)

In [52]:
X = df.iloc[:, 0:4]
y = df.iloc[:, -1] 

In [53]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=2)

In [54]:
X_train.head()

Unnamed: 0,brand,km_driven,fuel,owner
5571,Hyundai,35000,Diesel,First Owner
2038,Jeep,60000,Diesel,First Owner
2957,Hyundai,25000,Petrol,First Owner
7618,Mahindra,130000,Diesel,Second Owner
6684,Hyundai,155000,Diesel,First Owner


In [55]:
print("Training features shape:", X_train.shape)
print("Testing features shape:", X_test.shape)

Training features shape: (6502, 4)
Testing features shape: (1626, 4)


In [56]:
ohe = OneHotEncoder(drop='first', sparse=False, dtype=np.int32)

In [57]:
X_train_encoded = ohe.fit_transform(X_train[['fuel', 'owner']])
X_test_encoded = ohe.transform(X_test[['fuel', 'owner']])



In [58]:
print("Encoded training features shape:", X_train_encoded.shape)
print("Encoded testing features shape:", X_test_encoded.shape)

Encoded training features shape: (6502, 7)
Encoded testing features shape: (1626, 7)


In [59]:
X_train_final = np.hstack((X_train[['brand', 'km_driven']].values, X_train_encoded))
X_test_final = np.hstack((X_test[['brand', 'km_driven']].values, X_test_encoded))

In [60]:
print("Final training features shape:", X_train_final.shape)
print("Final testing features shape:", X_test_final.shape)

Final training features shape: (6502, 9)
Final testing features shape: (1626, 9)


In [61]:
print("\nCombined Training Features (first 5 rows):")
print(X_train_final[:5])


Combined Training Features (first 5 rows):
[['Hyundai' 35000 1 0 0 0 0 0 0]
 ['Jeep' 60000 1 0 0 0 0 0 0]
 ['Hyundai' 25000 0 0 1 0 0 0 0]
 ['Mahindra' 130000 1 0 0 0 1 0 0]
 ['Hyundai' 155000 1 0 0 0 0 0 0]]


**OneHotEncoding with Top Categories**

In [47]:
counts = df['brand'].value_counts()

In [48]:
df['brand'].nunique()
threshold = 100

In [49]:
repl = counts[counts <= threshold].index

In [50]:
pd.get_dummies(df['brand'].replace(repl, 'uncommon')).sample(5)

Unnamed: 0,BMW,Chevrolet,Ford,Honda,Hyundai,Mahindra,Maruti,Renault,Skoda,Tata,Toyota,Volkswagen,uncommon
5599,False,False,False,False,False,False,False,False,False,False,False,True,False
6695,False,False,False,False,False,False,True,False,False,False,False,False,False
222,False,False,False,False,False,False,False,False,False,False,False,False,True
4829,True,False,False,False,False,False,False,False,False,False,False,False,False
318,False,False,False,False,False,False,False,False,False,False,True,False,False
