# One Hot Encoding
- Used to handle nominal categorical data.
- Each unique category becomes a column in itself.
- If there are n categories, using ohe we'll get n-1 new columns. One column dropped to avoid dummy variable trap.

## Data collection

In [1]:
import pandas as pd
import numpy as np

In [2]:
df = pd.read_csv('../Datasets/cars.csv')

In [3]:
df.sample(5)

Unnamed: 0,brand,km_driven,fuel,owner,selling_price
3260,Mahindra,120000,Diesel,First Owner,880000
6929,Tata,70000,Diesel,First Owner,160000
2142,Ford,50000,Diesel,First Owner,650000
6832,Hyundai,62000,Diesel,Second Owner,509999
5348,Honda,127991,Diesel,First Owner,675000


As you can see, there are 3 nomnal data -> brand, fuel, owner

Lets try to get the number of unique values or categories for each attribute

In [4]:
print(df['brand'].nunique(), df['fuel'].nunique(), df['owner'].nunique())

32 4 5


Hence you can see there 32 diff brands, 4 diff fuel, 5 diff owners.

What are these values?

In [5]:
df['brand'].value_counts()

brand
Maruti           2448
Hyundai          1415
Mahindra          772
Tata              734
Toyota            488
Honda             467
Ford              397
Chevrolet         230
Renault           228
Volkswagen        186
BMW               120
Skoda             105
Nissan             81
Jaguar             71
Volvo              67
Datsun             65
Mercedes-Benz      54
Fiat               47
Audi               40
Lexus              34
Jeep               31
Mitsubishi         14
Force               6
Land                6
Isuzu               5
Kia                 4
Ambassador          4
Daewoo              3
MG                  3
Ashok               1
Opel                1
Peugeot             1
Name: count, dtype: int64

In [6]:
df['fuel'].value_counts()

fuel
Diesel    4402
Petrol    3631
CNG         57
LPG         38
Name: count, dtype: int64

In [7]:
df['owner'].value_counts()

owner
First Owner             5289
Second Owner            2105
Third Owner              555
Fourth & Above Owner     174
Test Drive Car             5
Name: count, dtype: int64

# OHE using pandas

In [8]:
df.shape

(8128, 5)

You can see there are 5 columns in the dataset
After applying ohe on fuel and owner -> we must get 5 -1 -1 +4 +5 = 12 columns

In [9]:
df_ohe = pd.get_dummies(df, columns = ['fuel','owner'])

In [10]:
df_ohe.sample(5)

Unnamed: 0,brand,km_driven,selling_price,fuel_CNG,fuel_Diesel,fuel_LPG,fuel_Petrol,owner_First Owner,owner_Fourth & Above Owner,owner_Second Owner,owner_Test Drive Car,owner_Third Owner
4383,Volkswagen,5400,1350000,False,True,False,False,False,False,False,True,False
2722,BMW,60000,750000,False,True,False,False,False,False,True,False,False
7563,Ford,13000,690000,False,False,False,True,True,False,False,False,False
2949,Honda,3100,750000,False,False,False,True,True,False,False,False,False
3696,Tata,85700,175000,False,True,False,False,False,False,True,False,False


In [11]:
df_ohe.shape

(8128, 12)

As you can see number of columns changed from 5 to 12

But we talked about the dummy varaible trap. So we need to remove one column from each fuel and owner. Hence total columns should be 10.

## k-1 OHE using pandas

In [12]:
df_ohe = pd.get_dummies(df, columns = ['fuel','owner'],drop_first = True)

In [13]:
df_ohe.sample(5)

Unnamed: 0,brand,km_driven,selling_price,fuel_Diesel,fuel_LPG,fuel_Petrol,owner_Fourth & Above Owner,owner_Second Owner,owner_Test Drive Car,owner_Third Owner
1771,Hyundai,20000,645000,False,False,True,False,False,False,False
7801,Fiat,61000,265000,True,False,False,False,True,False,False
1271,Mahindra,100000,480000,True,False,False,False,False,False,False
5206,Hyundai,120000,160000,False,False,False,False,True,False,False
7661,Toyota,3000,3200000,True,False,False,False,False,False,False


In [14]:
df_ohe.shape

(8128, 10)

## OHE using sklearn

In [15]:
from sklearn.preprocessing import OneHotEncoder

In [16]:
#drop = 'first' will drop the first column after encoding
#sparse_output = False will not create a sparse matrix and hence we do not need to do toarray()
ohe = OneHotEncoder(drop = 'first',sparse_output = False,dtype = np.int32) 

In [17]:
df.sample(5)

Unnamed: 0,brand,km_driven,fuel,owner,selling_price
3418,Maruti,5621,Petrol,First Owner,650000
8010,Hyundai,60000,Petrol,Second Owner,270000
7106,Mahindra,183000,Diesel,First Owner,370000
7504,Ford,104300,Diesel,First Owner,275000
6076,Maruti,7800,Petrol,First Owner,700000


In [18]:
from sklearn.model_selection import train_test_split

In [19]:
X_train, X_test, y_train, y_test = train_test_split(df.iloc[:,0:4],df.iloc[:,-1],test_size = 0.2, random_state = 0)

In [20]:
X_train.sample(5)

Unnamed: 0,brand,km_driven,fuel,owner
6796,Hyundai,40000,Petrol,First Owner
4037,Maruti,50000,Diesel,First Owner
6605,Renault,133000,Diesel,First Owner
3573,Hyundai,53000,Diesel,First Owner
654,Honda,56494,Petrol,First Owner


In [21]:
y_train.sample(5)

1264    170000
1298    535000
2610    611000
7853    950000
7909    850000
Name: selling_price, dtype: int64

In [22]:
# On running the below code, ohe produces a sparse matrix. To see it in a numpy array we need to add the .toarray()
#X_train_new = ohe.fit_transform(X_train[['fuel','owner']])
# X_train_new = ohe.fit_transform(X_train[['fuel','owner']]).toarray()
X_train_new = ohe.fit_transform(X_train[['fuel','owner']])

In [23]:
X_train_new

array([[0, 1, 0, ..., 0, 0, 0],
       [1, 0, 0, ..., 0, 0, 1],
       [1, 0, 0, ..., 1, 0, 0],
       ...,
       [0, 0, 1, ..., 1, 0, 0],
       [1, 0, 0, ..., 0, 0, 0],
       [0, 0, 1, ..., 0, 0, 0]], dtype=int32)

In [24]:
# X_test_new = ohe.transform(X_test[['fuel','owner']]).toarray()
X_test_new = ohe.transform(X_test[['fuel','owner']])

In [25]:
X_test_new

array([[1, 0, 0, ..., 0, 0, 0],
       [1, 0, 0, ..., 0, 0, 0],
       [0, 0, 1, ..., 0, 0, 0],
       ...,
       [1, 0, 0, ..., 0, 0, 0],
       [0, 0, 1, ..., 0, 0, 0],
       [1, 0, 0, ..., 1, 0, 0]], dtype=int32)

In [26]:
print(X_train.shape, X_train_new.shape)
print(X_test.shape, X_test_new.shape)

(6502, 4) (6502, 7)
(1626, 4) (1626, 7)


What we have done is encoded the columns fuel and owner to an numpy array. Now we want this to join with original df.
For that I have to first conver the columns brand and km_driven to numpy array and then stack them together.

In [27]:
X_train[['brand','km_driven']]

Unnamed: 0,brand,km_driven
3042,Hyundai,60000
1520,Tata,150000
2611,Hyundai,110000
3544,Mahindra,28000
4138,Maruti,15000
...,...,...
4931,Tata,70000
3264,Ford,100000
1653,Hyundai,90000
2607,Volkswagen,90000


In [28]:
#converting in numpy array
X_train[['brand','km_driven']].values

array([['Hyundai', 60000],
       ['Tata', 150000],
       ['Hyundai', 110000],
       ...,
       ['Hyundai', 90000],
       ['Volkswagen', 90000],
       ['Hyundai', 110000]], dtype=object)

In [29]:
# Horizontally stacking the columns 'brand','km_driven','all fuel categories','allowner_categories'
# before -> No of columns = 5 after encoding no of columns should be brand,km_drivn, 3 fuel, 4 owners = 9 columns
df_new = np.hstack((X_train[['brand','km_driven']].values,X_train_new))

In [30]:
df_new

array([['Hyundai', 60000, 0, ..., 0, 0, 0],
       ['Tata', 150000, 1, ..., 0, 0, 1],
       ['Hyundai', 110000, 1, ..., 1, 0, 0],
       ...,
       ['Hyundai', 90000, 0, ..., 1, 0, 0],
       ['Volkswagen', 90000, 1, ..., 0, 0, 0],
       ['Hyundai', 110000, 0, ..., 0, 0, 0]], dtype=object)

In [31]:
df_new.shape

(6502, 9)

## OHE on 'brands' columns which has many categories
- what we'll do is, take a threshold value say 100 cars. If any brand has less than 100 cars we'll take it under the category uncommon

In [32]:
df['brand'].value_counts()

brand
Maruti           2448
Hyundai          1415
Mahindra          772
Tata              734
Toyota            488
Honda             467
Ford              397
Chevrolet         230
Renault           228
Volkswagen        186
BMW               120
Skoda             105
Nissan             81
Jaguar             71
Volvo              67
Datsun             65
Mercedes-Benz      54
Fiat               47
Audi               40
Lexus              34
Jeep               31
Mitsubishi         14
Force               6
Land                6
Isuzu               5
Kia                 4
Ambassador          4
Daewoo              3
MG                  3
Ashok               1
Opel                1
Peugeot             1
Name: count, dtype: int64

In [33]:
counts = df['brand'].value_counts()

In [34]:
df['brand'].nunique()

32

In [35]:
threshold = 100

In [36]:
#gets all the brand names having less than 100 cars
counts[counts <= threshold].index

Index(['Nissan', 'Jaguar', 'Volvo', 'Datsun', 'Mercedes-Benz', 'Fiat', 'Audi',
       'Lexus', 'Jeep', 'Mitsubishi', 'Force', 'Land', 'Isuzu', 'Kia',
       'Ambassador', 'Daewoo', 'MG', 'Ashok', 'Opel', 'Peugeot'],
      dtype='object', name='brand')

In [37]:
repl = counts[counts <= threshold].index

In [38]:
dummy_df = pd.get_dummies(df['brand'].replace(repl,'uncommon'))

In [39]:
dummy_df.astype(int)

Unnamed: 0,BMW,Chevrolet,Ford,Honda,Hyundai,Mahindra,Maruti,Renault,Skoda,Tata,Toyota,Volkswagen,uncommon
0,0,0,0,0,0,0,1,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,1,0,0,0,0
2,0,0,0,1,0,0,0,0,0,0,0,0,0
3,0,0,0,0,1,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,1,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...
8123,0,0,0,0,1,0,0,0,0,0,0,0,0
8124,0,0,0,0,1,0,0,0,0,0,0,0,0
8125,0,0,0,0,0,0,1,0,0,0,0,0,0
8126,0,0,0,0,0,0,0,0,0,1,0,0,0
