
## 2. Encoding Categorical Variables in a Car Evaluation Dataset
### <b>Task:</b> Encode categorical variables in the Car Evaluation dataset using one-hot encoding and label encoding. Compare the results.

In [2]:
# Importing Libraries
import pandas as pd
from sklearn.preprocessing import LabelEncoder

In [3]:
# Loading the Car Evaluation Dataset
carEvaluation_dataset = pd.read_csv('Datasets\\CarEvaluation.csv', header=None)

print(carEvaluation_dataset.shape, '\n')
carEvaluation_dataset.head()

(1728, 7) 



Unnamed: 0,0,1,2,3,4,5,6
0,vhigh,vhigh,2,2,small,low,unacc
1,vhigh,vhigh,2,2,small,med,unacc
2,vhigh,vhigh,2,2,small,high,unacc
3,vhigh,vhigh,2,2,med,low,unacc
4,vhigh,vhigh,2,2,med,med,unacc


In [4]:
# Naming the columns
carEvaluation_dataset.columns = ['Buying', 'Maint', 'Doors', 'Persons', 'Lug_boot', 'Safety', 'Class']
carEvaluation_dataset.head()

Unnamed: 0,Buying,Maint,Doors,Persons,Lug_boot,Safety,Class
0,vhigh,vhigh,2,2,small,low,unacc
1,vhigh,vhigh,2,2,small,med,unacc
2,vhigh,vhigh,2,2,small,high,unacc
3,vhigh,vhigh,2,2,med,low,unacc
4,vhigh,vhigh,2,2,med,med,unacc


In [5]:
carEvaluation_dataset.tail()

Unnamed: 0,Buying,Maint,Doors,Persons,Lug_boot,Safety,Class
1723,low,low,5more,more,med,med,good
1724,low,low,5more,more,med,high,vgood
1725,low,low,5more,more,big,low,unacc
1726,low,low,5more,more,big,med,good
1727,low,low,5more,more,big,high,vgood


In [6]:
# Checking for null values
carEvaluation_dataset.isnull().sum()

Buying      0
Maint       0
Doors       0
Persons     0
Lug_boot    0
Safety      0
Class       0
dtype: int64

-> So, there are no missing values in the dataset. We can now proceed to encoding the categorical variables.

In [7]:
# Printing datatypes of features
carEvaluation_dataset.dtypes

Buying      object
Maint       object
Doors       object
Persons     object
Lug_boot    object
Safety      object
Class       object
dtype: object

-> So, all the variables are categorical variables (datatype = object). We have to encode each and every variable included.

In [8]:
# Printing the categories in each categorical column
for column in carEvaluation_dataset.columns:
    print(f"{column}: {carEvaluation_dataset[column].nunique()} categories: {carEvaluation_dataset[column].unique()}")

Buying: 4 categories: ['vhigh' 'high' 'med' 'low']
Maint: 4 categories: ['vhigh' 'high' 'med' 'low']
Doors: 4 categories: ['2' '3' '4' '5more']
Persons: 3 categories: ['2' '4' 'more']
Lug_boot: 3 categories: ['small' 'med' 'big']
Safety: 3 categories: ['low' 'med' 'high']
Class: 4 categories: ['unacc' 'acc' 'vgood' 'good']


## <li>One Hot Encoding</li>

In [9]:
# One Hot Encoding
one_hot_encoded_dataset = pd.get_dummies(carEvaluation_dataset)
one_hot_encoded_dataset = one_hot_encoded_dataset.astype(int)  # To convert it to 0/1 values

print(one_hot_encoded_dataset.shape, '\n')
one_hot_encoded_dataset.head()

(1728, 25) 



Unnamed: 0,Buying_high,Buying_low,Buying_med,Buying_vhigh,Maint_high,Maint_low,Maint_med,Maint_vhigh,Doors_2,Doors_3,...,Lug_boot_big,Lug_boot_med,Lug_boot_small,Safety_high,Safety_low,Safety_med,Class_acc,Class_good,Class_unacc,Class_vgood
0,0,0,0,1,0,0,0,1,1,0,...,0,0,1,0,1,0,0,0,1,0
1,0,0,0,1,0,0,0,1,1,0,...,0,0,1,0,0,1,0,0,1,0
2,0,0,0,1,0,0,0,1,1,0,...,0,0,1,1,0,0,0,0,1,0
3,0,0,0,1,0,0,0,1,1,0,...,0,1,0,0,1,0,0,0,1,0
4,0,0,0,1,0,0,0,1,1,0,...,0,1,0,0,0,1,0,0,1,0


In [10]:
# Printing datatypes of one hot encoded dataset
one_hot_encoded_dataset.dtypes

Buying_high       int64
Buying_low        int64
Buying_med        int64
Buying_vhigh      int64
Maint_high        int64
Maint_low         int64
Maint_med         int64
Maint_vhigh       int64
Doors_2           int64
Doors_3           int64
Doors_4           int64
Doors_5more       int64
Persons_2         int64
Persons_4         int64
Persons_more      int64
Lug_boot_big      int64
Lug_boot_med      int64
Lug_boot_small    int64
Safety_high       int64
Safety_low        int64
Safety_med        int64
Class_acc         int64
Class_good        int64
Class_unacc       int64
Class_vgood       int64
dtype: object

-> So each object feature is converted to binary columns equal to number of categories it contains (total 25) and datatype of each category is also now converted to int.

## <li>Label Encoding</li>

In [11]:
# Label Encoding
label_encoder = LabelEncoder()
label_encoded_dataset = carEvaluation_dataset.copy()

for column in carEvaluation_dataset.columns:
    label_encoded_dataset[column] = label_encoder.fit_transform(carEvaluation_dataset[column])

print(label_encoded_dataset.shape, '\n')
label_encoded_dataset.head()

(1728, 7) 



Unnamed: 0,Buying,Maint,Doors,Persons,Lug_boot,Safety,Class
0,3,3,0,0,2,1,2
1,3,3,0,0,2,2,2
2,3,3,0,0,2,0,2
3,3,3,0,0,1,1,2
4,3,3,0,0,1,2,2


In [12]:
# Printing categories of the label encoded dataset
for column in label_encoded_dataset.columns:
    print(f"{column}: {label_encoded_dataset[column].nunique()} categories: {label_encoded_dataset[column].unique()}")
print('\n')

# Printing datatypes of label encoded dataset
label_encoded_dataset.dtypes

Buying: 4 categories: [3 0 2 1]
Maint: 4 categories: [3 0 2 1]
Doors: 4 categories: [0 1 2 3]
Persons: 3 categories: [0 1 2]
Lug_boot: 3 categories: [2 1 0]
Safety: 3 categories: [1 2 0]
Class: 4 categories: [2 0 3 1]




Buying      int64
Maint       int64
Doors       int64
Persons     int64
Lug_boot    int64
Safety      int64
Class       int64
dtype: object

-> So, all the object columns have been converted to int columns, retaining the shape of original dataframe.

### Comparing the results

In [13]:
print('One Hot Encoded Data Shape:', one_hot_encoded_dataset.shape)
print('Label Encoded Data Shape:', label_encoded_dataset.shape)

One Hot Encoded Data Shape: (1728, 25)
Label Encoded Data Shape: (1728, 7)


<p><b>One Hot Encoding: </b> As we saw the shape of one hot encoded dataset (1728, 25), it created new columns in the dataset. Each category of a feature has its own binary columns which increases the total features drastically. So the memory size of the dataset will increase. On the other hand, it is also useful for algorithms that do not assume any ordinal relationship between categories.</p>

<p><b>Label Encoding: </b> As we saw the shape of label encoded data (1728, 7), it is same as that of original dataset. So it doesnot create new columns which is plus point for this encoding type. On the contrast, it introduces ordinal relationships which might not be appropriate for all algorithms. It is more suitable for tree-based models which can handle categorical data encoded as integers.</p>

<hr>