# Lebel Encoder

**Label Encoder:**
Label Encoder is used to convert categorical labels into numerical values. Each unique category is assigned a unique integer. This encoding is suitable for ordinal data, where the categories have a meaningful order.

**Example:**
Suppose you have a dataset with a "Size" feature containing categories like "Small," "Medium," and "Large." Using Label Encoding, you can map these categories to integers as follows:

- Small: 0
- Medium: 1
- Large: 2

So, the "Size" feature is transformed into numerical values, allowing machine learning algorithms to work with it.

- Label Encoding is suitable for **ordinal** data

In [1]:
# from sklearn.preprocessing import LabelEncoder

# # Sample data with a categorical feature
# data = ['cat', 'dog', 'fish', 'dog', 'cat']

# # Initialize the LabelEncoder
# label_encoder = LabelEncoder()

# # Fit and transform the data
# encoded_data = label_encoder.fit_transform(data)

# # Display the original data and the encoded data
# print("Original Data:", data)
# print("Encoded Data:", encoded_data)

# Ordinal Encoder

**Ordinal Encoder:**
Ordinal Encoder is used for ordinal data, just like Label Encoder, but it allows you to specify the order explicitly. This is useful when you have categorical data with an inherent order, and you want to control the mapping.

**Example:**
Consider an "Education Level" feature with categories "High School," "Bachelor's," "Master's," and "Ph.D." Using Ordinal Encoding, you can define the order and map these categories as follows:

- High School: 0
- Bachelor's: 1
- Master's: 2
- Ph.D.: 3

In this case, you are explicitly specifying the order of the education levels.

- Ordinal Encoding is suitable for **ordinal** data

In [None]:
# from sklearn.preprocessing import OrdinalEncoder

# # Sample data with an ordered categorical feature
# data = [['Cold'], ['Warm'], ['Hot'], ['Cold'], ['Hot']]

# # Define the order of the categories
# category_order = [['Cold', 'Warm', 'Hot']]

# # Initialize the OrdinalEncoder with the category order
# ordinal_encoder = OrdinalEncoder(categories=category_order)

# # Fit and transform the data
# encoded_data = ordinal_encoder.fit_transform(data)

# # Display the original data and the encoded data
# print("Original Data:", data)
# print("Encoded Data:", encoded_data)

 # One-Hot Encoder

**One-Hot Encoder:**
One-hot encoding is used when there is no inherent order in the categories, and you want to create binary columns for each category. Each category is represented as a binary vector where only one element is "1," and the rest are "0."

**Example:**
Let's consider a "Color" feature with categories "Red," "Blue," and "Green." Using One-Hot Encoding, you create three binary columns:

- Red: [1, 0, 0]
- Blue: [0, 1, 0]
- Green: [0, 0, 1]

Each row in your dataset is represented by a combination of these binary vectors, indicating the color of that data point.
- One-hot encoding is typically used for **nominal** data without any inherent order.

In [None]:
# from sklearn.preprocessing import OneHotEncoder
# import numpy as np

# # Sample data with a categorical feature
# data = [['Red'], ['Blue'], ['Green'], ['Green'], ['Red']]

# # Initialize the OneHotEncoder
# one_hot_encoder = OneHotEncoder()

# # Fit and transform the data
# encoded_data = one_hot_encoder.fit_transform(data)

# # Convert the sparse matrix to a dense array for better visualization
# encoded_data_array = encoded_data.toarray()

# # Display the original data and the encoded data
# print("Original Data:", data)
# print("Encoded Data:")
# print(encoded_data_array)

# Import Necessary libraries

In [62]:
import pandas as pd
import numpy as np 
import matplotlib.pyplot as plt
import seaborn as sns
import cv2

%matplotlib inline

In [63]:
df = pd.read_csv("../data/CleanData3.csv", index_col = "Unnamed: 0" )

In [64]:
df_cat = df.select_dtypes("object").drop(columns = "CarName")
df_cat

Unnamed: 0,fueltype,aspiration,doornumber,carbody,drivewheel,enginelocation,enginetype,cylindernumber,fuelsystem
0,gas,std,two,convertible,rwd,front,dohc,four,mpfi
1,gas,std,two,convertible,rwd,front,dohc,four,mpfi
2,gas,std,two,hatchback,rwd,front,ohcv,six,mpfi
3,gas,std,four,sedan,fwd,front,ohc,four,mpfi
4,gas,std,four,sedan,4wd,front,ohc,five,mpfi
...,...,...,...,...,...,...,...,...,...
200,gas,std,four,sedan,rwd,front,ohc,four,mpfi
201,gas,turbo,four,sedan,rwd,front,ohc,four,mpfi
202,gas,std,four,sedan,rwd,front,ohcv,six,mpfi
203,diesel,turbo,four,sedan,rwd,front,ohc,six,idi


In [65]:
binary = []
multiclass = []

for col in df_cat.columns:
    if len(df_cat[[col]].value_counts().index) > 2:
        multiclass.append(col)
    else:
        binary.append(col)

In [66]:
binary

['fueltype', 'aspiration', 'doornumber', 'enginelocation']

In [67]:
multiclass

['carbody', 'drivewheel', 'enginetype', 'cylindernumber', 'fuelsystem']

In [68]:
from sklearn.preprocessing import LabelEncoder, OneHotEncoder, Binarizer, OrdinalEncoder

In [69]:
def BinaryEncoder(col):
    global df_cat
    encoder = LabelEncoder()
    df_cat.loc[:, col] = encoder.fit_transform(df_cat[col])

In [70]:
for col in binary:
    BinaryEncoder(col)

In [71]:
df_cat

Unnamed: 0,fueltype,aspiration,doornumber,carbody,drivewheel,enginelocation,enginetype,cylindernumber,fuelsystem
0,1,0,1,convertible,rwd,0,dohc,four,mpfi
1,1,0,1,convertible,rwd,0,dohc,four,mpfi
2,1,0,1,hatchback,rwd,0,ohcv,six,mpfi
3,1,0,0,sedan,fwd,0,ohc,four,mpfi
4,1,0,0,sedan,4wd,0,ohc,five,mpfi
...,...,...,...,...,...,...,...,...,...
200,1,0,0,sedan,rwd,0,ohc,four,mpfi
201,1,1,0,sedan,rwd,0,ohc,four,mpfi
202,1,0,0,sedan,rwd,0,ohcv,six,mpfi
203,0,1,0,sedan,rwd,0,ohc,six,idi


In [72]:
def multiclassEncoder(cols):
    global df_cat
    encodings = pd.get_dummies(df_cat[multiclass]).replace({True: 1, False:0})
    df_cat.drop(columns=cols, inplace=True)
    df_cat = pd.concat([df_cat, encodings], axis=1)

In [73]:
multiclassEncoder(multiclass)

In [74]:
df_cat

Unnamed: 0,fueltype,aspiration,doornumber,enginelocation,carbody_convertible,carbody_hardtop,carbody_hatchback,carbody_sedan,carbody_wagon,drivewheel_4wd,...,cylindernumber_twelve,cylindernumber_two,fuelsystem_1bbl,fuelsystem_2bbl,fuelsystem_4bbl,fuelsystem_idi,fuelsystem_mfi,fuelsystem_mpfi,fuelsystem_spdi,fuelsystem_spfi
0,1,0,1,0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
1,1,0,1,0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
2,1,0,1,0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,1,0,0
3,1,0,0,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,1,0,0
4,1,0,0,0,0,0,0,1,0,1,...,0,0,0,0,0,0,0,1,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
200,1,0,0,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,1,0,0
201,1,1,0,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,1,0,0
202,1,0,0,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,1,0,0
203,0,1,0,0,0,0,0,1,0,0,...,0,0,0,0,0,1,0,0,0,0


In [75]:
df_cat['car_ID'] = df.car_ID
df_cat

Unnamed: 0,fueltype,aspiration,doornumber,enginelocation,carbody_convertible,carbody_hardtop,carbody_hatchback,carbody_sedan,carbody_wagon,drivewheel_4wd,...,cylindernumber_two,fuelsystem_1bbl,fuelsystem_2bbl,fuelsystem_4bbl,fuelsystem_idi,fuelsystem_mfi,fuelsystem_mpfi,fuelsystem_spdi,fuelsystem_spfi,car_ID
0,1,0,1,0,1,0,0,0,0,0,...,0,0,0,0,0,0,1,0,0,1
1,1,0,1,0,1,0,0,0,0,0,...,0,0,0,0,0,0,1,0,0,2
2,1,0,1,0,0,0,1,0,0,0,...,0,0,0,0,0,0,1,0,0,3
3,1,0,0,0,0,0,0,1,0,0,...,0,0,0,0,0,0,1,0,0,4
4,1,0,0,0,0,0,0,1,0,1,...,0,0,0,0,0,0,1,0,0,5
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
200,1,0,0,0,0,0,0,1,0,0,...,0,0,0,0,0,0,1,0,0,201
201,1,1,0,0,0,0,0,1,0,0,...,0,0,0,0,0,0,1,0,0,202
202,1,0,0,0,0,0,0,1,0,0,...,0,0,0,0,0,0,1,0,0,203
203,0,1,0,0,0,0,0,1,0,0,...,0,0,0,0,1,0,0,0,0,204


In [76]:
df.drop(columns=df.select_dtypes('object').columns, inplace=True)

In [77]:
df

Unnamed: 0,car_ID,symboling,wheelbase,carlength,carwidth,carheight,curbweight,enginesize,boreratio,stroke,compressionratio,horsepower,peakrpm,citympg,highwaympg,price
0,1,3,88.6,168.8,64.1,48.8,2548,130,3.47,2.68,9.0,111,5000,21,27,13495.0
1,2,3,88.6,168.8,64.1,48.8,2548,130,3.47,2.68,9.0,111,5000,21,27,16500.0
2,3,1,94.5,171.2,65.5,52.4,2823,152,2.68,3.47,9.0,154,5000,19,26,16500.0
3,4,2,99.8,176.6,66.2,54.3,2337,109,3.19,3.40,10.0,102,5500,24,30,13950.0
4,5,2,99.4,176.6,66.4,54.3,2824,136,3.19,3.40,8.0,115,5500,18,22,17450.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
200,201,-1,109.1,188.8,68.9,55.5,2952,141,3.78,3.15,9.5,114,5400,23,28,16845.0
201,202,-1,109.1,188.8,68.8,55.5,3049,141,3.78,3.15,8.7,160,5300,19,25,19045.0
202,203,-1,109.1,188.8,68.9,55.5,3012,173,3.58,2.87,8.8,134,5500,18,23,21485.0
203,204,-1,109.1,188.8,68.9,55.5,3217,145,3.01,3.40,23.0,106,4800,26,27,22470.0


In [79]:
df = df.merge(df_cat, on='car_ID', how='inner')

In [80]:
df.head()

Unnamed: 0,car_ID,symboling,wheelbase,carlength,carwidth,carheight,curbweight,enginesize,boreratio,stroke,...,cylindernumber_twelve,cylindernumber_two,fuelsystem_1bbl,fuelsystem_2bbl,fuelsystem_4bbl,fuelsystem_idi,fuelsystem_mfi,fuelsystem_mpfi,fuelsystem_spdi,fuelsystem_spfi
0,1,3,88.6,168.8,64.1,48.8,2548,130,3.47,2.68,...,0,0,0,0,0,0,0,1,0,0
1,2,3,88.6,168.8,64.1,48.8,2548,130,3.47,2.68,...,0,0,0,0,0,0,0,1,0,0
2,3,1,94.5,171.2,65.5,52.4,2823,152,2.68,3.47,...,0,0,0,0,0,0,0,1,0,0
3,4,2,99.8,176.6,66.2,54.3,2337,109,3.19,3.4,...,0,0,0,0,0,0,0,1,0,0
4,5,2,99.4,176.6,66.4,54.3,2824,136,3.19,3.4,...,0,0,0,0,0,0,0,1,0,0


In [None]:
df.to_csv('../data/encodedData.csv')