# Categorical feature

Categorical Data is a type of data that can be stored into groups or categories with the aid of names or labels

### Categorical Variable 
- Nominal (unordered Categories) eg. Male/Female  
- Ordinal ( ordered Categories ) eg. Bad / Neutral / Good 

# Types of Encoding  

- One-hot / Dummy Encoding 
- Label / Ordinal Encoding 
- target Encoding 
- Frequency / count encoding 
- Binary Encoding 
- Feature hashing 

# One-hot / Dummy Encoding

### dummy variable trap ->> 
- when we create n columns for n  features in data 
  then work with n-1 column as well. this is called dummy variable trap
- if dont remove first column then we can face problem with multicololinearity  

# Implementation  

In [4]:
import numpy as np 
import pandas as pd 


In [5]:
data = pd.read_csv('cars.csv')

In [6]:
data

Unnamed: 0,brand,km_driven,fuel,owner,selling_price
0,Maruti,145500,Diesel,First Owner,450000
1,Skoda,120000,Diesel,Second Owner,370000
2,Honda,140000,Petrol,Third Owner,158000
3,Hyundai,127000,Diesel,First Owner,225000
4,Maruti,120000,Petrol,First Owner,130000
...,...,...,...,...,...
8123,Hyundai,110000,Petrol,First Owner,320000
8124,Hyundai,119000,Diesel,Fourth & Above Owner,135000
8125,Maruti,120000,Diesel,First Owner,382000
8126,Tata,25000,Diesel,First Owner,290000


In [7]:
data['brand'].value_counts()

brand
Maruti           2448
Hyundai          1415
Mahindra          772
Tata              734
Toyota            488
Honda             467
Ford              397
Chevrolet         230
Renault           228
Volkswagen        186
BMW               120
Skoda             105
Nissan             81
Jaguar             71
Volvo              67
Datsun             65
Mercedes-Benz      54
Fiat               47
Audi               40
Lexus              34
Jeep               31
Mitsubishi         14
Force               6
Land                6
Isuzu               5
Kia                 4
Ambassador          4
Daewoo              3
MG                  3
Ashok               1
Opel                1
Peugeot             1
Name: count, dtype: int64

In [8]:
data['brand'].nunique()

32

In [10]:
data['fuel'].nunique()

4

# 1. OneHotEncoding using Pandas 

In [12]:
pd.get_dummies(data, columns=["fuel","owner"])

Unnamed: 0,brand,km_driven,selling_price,fuel_CNG,fuel_Diesel,fuel_LPG,fuel_Petrol,owner_First Owner,owner_Fourth & Above Owner,owner_Second Owner,owner_Test Drive Car,owner_Third Owner
0,Maruti,145500,450000,False,True,False,False,True,False,False,False,False
1,Skoda,120000,370000,False,True,False,False,False,False,True,False,False
2,Honda,140000,158000,False,False,False,True,False,False,False,False,True
3,Hyundai,127000,225000,False,True,False,False,True,False,False,False,False
4,Maruti,120000,130000,False,False,False,True,True,False,False,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...
8123,Hyundai,110000,320000,False,False,False,True,True,False,False,False,False
8124,Hyundai,119000,135000,False,True,False,False,False,True,False,False,False
8125,Maruti,120000,382000,False,True,False,False,True,False,False,False,False
8126,Tata,25000,290000,False,True,False,False,True,False,False,False,False


# 2. k-1 OneHotEncoding  

In [13]:
pd.get_dummies(data, columns=['fuel' , 'owner'] , drop_first = True ) 

Unnamed: 0,brand,km_driven,selling_price,fuel_Diesel,fuel_LPG,fuel_Petrol,owner_Fourth & Above Owner,owner_Second Owner,owner_Test Drive Car,owner_Third Owner
0,Maruti,145500,450000,True,False,False,False,False,False,False
1,Skoda,120000,370000,True,False,False,False,True,False,False
2,Honda,140000,158000,False,False,True,False,False,False,True
3,Hyundai,127000,225000,True,False,False,False,False,False,False
4,Maruti,120000,130000,False,False,True,False,False,False,False
...,...,...,...,...,...,...,...,...,...,...
8123,Hyundai,110000,320000,False,False,True,False,False,False,False
8124,Hyundai,119000,135000,True,False,False,True,False,False,False
8125,Maruti,120000,382000,True,False,False,False,False,False,False
8126,Tata,25000,290000,True,False,False,False,False,False,False


# 3. OneHotEncoding using Sklearn 

In [37]:
from sklearn.model_selection import train_test_split 
x_train, x_test , y_train, y_test = train_test_split(data.iloc[:,0:4],data.iloc[:,-1], test_size=0.2)

In [38]:
x_train

Unnamed: 0,brand,km_driven,fuel,owner
5793,Hyundai,110000,CNG,Third Owner
7289,Chevrolet,74321,Diesel,First Owner
2198,Ford,13500,Petrol,First Owner
3599,Datsun,10000,Petrol,First Owner
3004,Maruti,100000,Diesel,First Owner
...,...,...,...,...
2992,Nissan,100000,Petrol,First Owner
6517,Hyundai,50000,Diesel,Second Owner
4709,Mahindra,110000,Diesel,Second Owner
3603,Hyundai,35000,Diesel,Second Owner


In [20]:
y_train

2575     651000
944     1000000
5475     430000
4932     421000
143      795000
         ...   
7916     600000
3218     350000
4653     850000
3296    1000000
6197     325000
Name: selling_price, Length: 6502, dtype: int64

In [21]:
from sklearn.preprocessing import OneHotEncoder 

In [55]:
ohe = OneHotEncoder(drop = "first") # droping 1st column for avoiding multicolinearity

In [56]:
x_train_new = ohe.fit_transform(x_train[["fuel","owner"]]).toarray()

In [57]:
x_test_new = ohe.fit_transform(x_test[["fuel","owner"]]).toarray()

In [58]:
x_train[["brand" , "km_driven"]]. values 

array([['Hyundai', 110000],
       ['Chevrolet', 74321],
       ['Ford', 13500],
       ...,
       ['Mahindra', 110000],
       ['Hyundai', 35000],
       ['Maruti', 90000]], dtype=object)

In [59]:
np.hstack((x_train[["brand" , "km_driven"]]. values , x_train_new))

array([['Hyundai', 110000, 0.0, ..., 0.0, 0.0, 1.0],
       ['Chevrolet', 74321, 1.0, ..., 0.0, 0.0, 0.0],
       ['Ford', 13500, 0.0, ..., 0.0, 0.0, 0.0],
       ...,
       ['Mahindra', 110000, 1.0, ..., 1.0, 0.0, 0.0],
       ['Hyundai', 35000, 1.0, ..., 1.0, 0.0, 0.0],
       ['Maruti', 90000, 1.0, ..., 0.0, 0.0, 0.0]], dtype=object)