# One Hot Encoding
This is a technique used to deal with categorical data. Whenever we have categorical data, we need to clean it because we can't feed textual data into a machine learning algorithm.<br/>
We also can't use numbers to denote them because of the following drawbacks:
- **Arbitrary Ordering:** Numerical encoding imposes an arbitrary ordering on the categories, which may not be appropriate for variables without a clear ordinal relationship.
- **Magnitude Influence:** Some machine learning algorithms might interpret numerical encodings as having meaningful magnitudes, which might lead to incorrect interpretations.
- **Loss of Information:** One-hot encoding preserves all information about the categorical variable, while numerical encoding might lose some information, especially if the numerical values do not reflect any meaningful relationship between categories.

In [1]:
import pandas as pd

dataFrame = pd.read_csv("carprices.csv")
dataFrame

Unnamed: 0,Car Model,Mileage,Sell Price($),Age(yrs)
0,BMW X5,69000,18000,6
1,BMW X5,35000,34000,3
2,BMW X5,57000,26100,5
3,BMW X5,22500,40000,2
4,BMW X5,46000,31500,4
5,Audi A5,59000,29400,5
6,Audi A5,52000,32000,5
7,Audi A5,72000,19300,6
8,Audi A5,91000,12000,8
9,Mercedez Benz C class,67000,22000,6


In [2]:
dummy = pd.get_dummies(dataFrame['Car Model'], dtype=int)
dummy

Unnamed: 0,Audi A5,BMW X5,Mercedez Benz C class
0,0,1,0
1,0,1,0
2,0,1,0
3,0,1,0
4,0,1,0
5,1,0,0
6,1,0,0
7,1,0,0
8,1,0,0
9,0,0,1


# Dummy variable trap
- The dummy variable trap occurs when one or more dummy variables are redundant, meaning they can be predicted from the other variables.
- In this case, we can drop any one of the three columns from the dummy-variable table since any one can be determined by checking the values of the other 2 variables.
- For example, we can say the a car's model is **Mercedez Benz C class** if the values for the fields **Audi A5** and **BMW X5** is 0. Same is true for the other models.
- We will drop the **Mercedez Benz C class** column. You can drop any other column as per your requirements.

In [3]:
dummy = dummy.drop("Mercedez Benz C class", axis='columns')
dummy

Unnamed: 0,Audi A5,BMW X5
0,0,1
1,0,1
2,0,1
3,0,1
4,0,1
5,1,0
6,1,0
7,1,0
8,1,0
9,0,0


In [4]:
# we also have to drop the 'Car Model' column from the original dataframe
dataFrame = dataFrame.drop("Car Model", axis = 'columns')
dataFrame = pd.concat([dataFrame, dummy], axis = 'columns')
dataFrame

Unnamed: 0,Mileage,Sell Price($),Age(yrs),Audi A5,BMW X5
0,69000,18000,6,0,1
1,35000,34000,3,0,1
2,57000,26100,5,0,1
3,22500,40000,2,0,1
4,46000,31500,4,0,1
5,59000,29400,5,1,0
6,52000,32000,5,1,0
7,72000,19300,6,1,0
8,91000,12000,8,1,0
9,67000,22000,6,0,0


Now we may do whatever we want with this dataset.
***