## Label Encoding - Ordinal Encoding

1. Ordinal encoder used if X: input columns have ordinal categorical data 
2. If Y: o/p variable is categorical -> Label Encoding 

### If i/p is categorical & nominal : one-hot-encoding
### If i/p is categorical & ordinal : ordinal-encoding

## Example
### Education
1. High School
2. PG
3. UG 

4. above is categorical and ordinal, hence ordinal-encode:  HS = 1, UG = 2, PG = 3

In [1]:
import pandas as pd
import numpy as np

df = pd.read_csv('customer.csv')
df.sample(5)

Unnamed: 0,age,gender,review,education,purchased
24,16,Female,Average,PG,Yes
18,19,Male,Good,School,No
4,16,Female,Average,UG,No
48,39,Female,Good,UG,Yes
45,61,Male,Poor,PG,Yes


1. age - Numerical
2. gender - Categorical-nominal ::: One Hot Encoder
3. review, education - Categorical-Ordinal ::: Ordinal Encoder
4. purchased - Categorical-nominal ::: Label Encoder

## Ordinal Encoding

In [2]:
df = df.iloc[:,2:]
df.head()

Unnamed: 0,review,education,purchased
0,Average,School,No
1,Poor,UG,No
2,Good,PG,No
3,Good,PG,No
4,Average,UG,No


In [3]:
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split(df.iloc[:,:2],df.iloc[:,2],test_size=0.2)
X_train.head()

Unnamed: 0,review,education
28,Poor,School
23,Good,School
1,Poor,UG
40,Good,School
46,Poor,PG


In [4]:
from sklearn.preprocessing import OrdinalEncoder
oe = OrdinalEncoder(categories=[['Poor','Average','Good'],['School','UG','PG']])

# Categories given to defing Order poor<avg<good else random
oe.fit(X_train)

In [5]:
X_train = oe.transform(X_train)
X_test = oe.transform(X_test)

X_train

array([[0., 0.],
       [2., 0.],
       [0., 1.],
       [2., 0.],
       [0., 2.],
       [1., 1.],
       [1., 0.],
       [0., 0.],
       [2., 2.],
       [0., 0.],
       [0., 2.],
       [0., 1.],
       [1., 1.],
       [2., 2.],
       [0., 1.],
       [0., 0.],
       [1., 2.],
       [2., 0.],
       [1., 0.],
       [0., 2.],
       [0., 0.],
       [2., 1.],
       [0., 2.],
       [2., 1.],
       [2., 0.],
       [2., 1.],
       [2., 1.],
       [2., 2.],
       [2., 2.],
       [1., 1.],
       [2., 1.],
       [0., 2.],
       [1., 2.],
       [2., 2.],
       [1., 0.],
       [1., 1.],
       [2., 0.],
       [1., 0.],
       [2., 2.],
       [0., 2.]])

In [6]:
oe.categories_

[array(['Poor', 'Average', 'Good'], dtype=object),
 array(['School', 'UG', 'PG'], dtype=object)]

## Label Encoding

In [7]:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()  # it randomly decides values
le.fit(y_train)

In [8]:
y_train = le.transform(y_train)
y_test = le.transform(y_test)

In [9]:
y_train

array([0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1, 1, 0, 0, 1, 1, 1, 1,
       0, 1, 0, 1, 1, 1, 1, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0])

# One Hot Encoding

1. Attribute: Color: values yellow,red,blue
2. It's like making 1 to 3 cols : color -> Yellow Red Blue
3. then marking if yellow = 1 0 0 , red = 0 1 0, blue = 0 0 1

4. However above gives a problem as sum of all rows will be 1 which cause issue in linear models

4. Above Problem is MULTI-COLLINEARITY

5. Hence we take only two: yellow = 0 0 , red = 1 0, blue = 0 1

6. If we had n Categories: we use (n-1) cols

In [10]:
f = pd.read_csv('cars.csv')
f.sample(5)

Unnamed: 0,brand,km_driven,fuel,owner,selling_price
7972,Maruti,108000,Diesel,First Owner,675000
6637,Tata,15000,Petrol,First Owner,450000
3607,Jeep,32000,Diesel,First Owner,1511000
7225,Maruti,20000,Petrol,First Owner,330000
1579,Maruti,50000,Diesel,Second Owner,425000


In [11]:
print(f"Total Brands: {f['brand'].nunique()}")
f['brand'].value_counts()

Total Brands: 32


brand
Maruti           2448
Hyundai          1415
Mahindra          772
Tata              734
Toyota            488
Honda             467
Ford              397
Chevrolet         230
Renault           228
Volkswagen        186
BMW               120
Skoda             105
Nissan             81
Jaguar             71
Volvo              67
Datsun             65
Mercedes-Benz      54
Fiat               47
Audi               40
Lexus              34
Jeep               31
Mitsubishi         14
Force               6
Land                6
Isuzu               5
Kia                 4
Ambassador          4
Daewoo              3
MG                  3
Ashok               1
Opel                1
Peugeot             1
Name: count, dtype: int64

In [12]:
f['owner'].nunique()

5

In [13]:
f['fuel'].nunique()

4

## 1. OHE using Pandas :: Can't Solve Multi-Linearity Problem

In [14]:
# Apply OHE on Fuel and Owner
# So fuel -> new 4 Col, Owner -> new 5 Cols
pd.get_dummies(f,columns=['fuel','owner'])



Unnamed: 0,brand,km_driven,selling_price,fuel_CNG,fuel_Diesel,fuel_LPG,fuel_Petrol,owner_First Owner,owner_Fourth & Above Owner,owner_Second Owner,owner_Test Drive Car,owner_Third Owner
0,Maruti,145500,450000,False,True,False,False,True,False,False,False,False
1,Skoda,120000,370000,False,True,False,False,False,False,True,False,False
2,Honda,140000,158000,False,False,False,True,False,False,False,False,True
3,Hyundai,127000,225000,False,True,False,False,True,False,False,False,False
4,Maruti,120000,130000,False,False,False,True,True,False,False,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...
8123,Hyundai,110000,320000,False,False,False,True,True,False,False,False,False
8124,Hyundai,119000,135000,False,True,False,False,False,True,False,False,False
8125,Maruti,120000,382000,False,True,False,False,True,False,False,False,False
8126,Tata,25000,290000,False,True,False,False,True,False,False,False,False


## K-1 Encoding:: Solve Multi Collinearity

In [15]:
pd.get_dummies(f,columns=['fuel','owner'],drop_first = True)

Unnamed: 0,brand,km_driven,selling_price,fuel_Diesel,fuel_LPG,fuel_Petrol,owner_Fourth & Above Owner,owner_Second Owner,owner_Test Drive Car,owner_Third Owner
0,Maruti,145500,450000,True,False,False,False,False,False,False
1,Skoda,120000,370000,True,False,False,False,True,False,False
2,Honda,140000,158000,False,False,True,False,False,False,True
3,Hyundai,127000,225000,True,False,False,False,False,False,False
4,Maruti,120000,130000,False,False,True,False,False,False,False
...,...,...,...,...,...,...,...,...,...,...
8123,Hyundai,110000,320000,False,False,True,False,False,False,False
8124,Hyundai,119000,135000,True,False,False,True,False,False,False
8125,Maruti,120000,382000,True,False,False,False,False,False,False
8126,Tata,25000,290000,True,False,False,False,False,False,False


## Using Sklearn
1. Remove the cols on which to apply -->  then apply 
2. Merge back

In [16]:
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split(f.iloc[:,:4],f.iloc[:,-1],test_size=0.2)
X_train.head()

Unnamed: 0,brand,km_driven,fuel,owner
2664,Lexus,20000,Petrol,First Owner
5555,Maruti,50000,Petrol,First Owner
987,Maruti,100000,Diesel,First Owner
3795,Tata,90000,Diesel,Second Owner
6543,Ford,18000,Petrol,Second Owner


In [17]:
from sklearn.preprocessing import OneHotEncoder
ohe = OneHotEncoder(drop='first',sparse=False,dtype=np.int32) # drop is to remove multi collinearity

X_train_new = ohe.fit_transform(X_train[['fuel','owner']])
X_test_new = ohe.transform(X_test[['fuel','owner']]) 



In [18]:
X_train_new.shape

(6502, 7)

### Merge

In [19]:
X_train[['brand','km_driven']].values

array([['Lexus', 20000],
       ['Maruti', 50000],
       ['Maruti', 100000],
       ...,
       ['Ford', 47000],
       ['Hyundai', 11000],
       ['Volkswagen', 110000]], dtype=object)

In [20]:
np.hstack((X_train[['brand','km_driven']].values, X_train_new)).shape

(6502, 9)

# OHE On Top Categories
1. Brand with car less than 100 will be merged as Others

In [21]:
Counts = f['brand'].value_counts()
threshold = 100 
Counts

brand
Maruti           2448
Hyundai          1415
Mahindra          772
Tata              734
Toyota            488
Honda             467
Ford              397
Chevrolet         230
Renault           228
Volkswagen        186
BMW               120
Skoda             105
Nissan             81
Jaguar             71
Volvo              67
Datsun             65
Mercedes-Benz      54
Fiat               47
Audi               40
Lexus              34
Jeep               31
Mitsubishi         14
Force               6
Land                6
Isuzu               5
Kia                 4
Ambassador          4
Daewoo              3
MG                  3
Ashok               1
Opel                1
Peugeot             1
Name: count, dtype: int64

In [22]:
repl = Counts[Counts <= threshold].index
repl

Index(['Nissan', 'Jaguar', 'Volvo', 'Datsun', 'Mercedes-Benz', 'Fiat', 'Audi',
       'Lexus', 'Jeep', 'Mitsubishi', 'Force', 'Land', 'Isuzu', 'Kia',
       'Ambassador', 'Daewoo', 'MG', 'Ashok', 'Opel', 'Peugeot'],
      dtype='object', name='brand')

In [23]:
pd.get_dummies(f['brand'].replace(repl , 'Uncommon')).sample(5)

Unnamed: 0,BMW,Chevrolet,Ford,Honda,Hyundai,Mahindra,Maruti,Renault,Skoda,Tata,Toyota,Uncommon,Volkswagen
903,False,False,False,False,False,False,False,False,False,False,True,False,False
5554,False,False,False,False,False,False,True,False,False,False,False,False,False
7111,False,False,False,False,False,False,True,False,False,False,False,False,False
2026,False,False,False,False,False,False,True,False,False,False,False,False,False
3026,False,False,False,True,False,False,False,False,False,False,False,False,False


# Column Transformer -- Best way to Encoding

In [24]:
df1 = pd.read_csv('covid_toy.csv')
df1.sample(5)

Unnamed: 0,age,gender,fever,cough,city,has_covid
82,24,Male,98.0,Mild,Kolkata,Yes
85,16,Female,103.0,Mild,Bangalore,Yes
29,34,Female,,Strong,Mumbai,Yes
4,65,Female,101.0,Mild,Mumbai,No
24,13,Female,100.0,Strong,Kolkata,No


Numerical :: Cat-Nom :: Num :: Cat-Ord :: Cat-Nom :: Cat-Nom

In [25]:
df1.isnull().sum()

age           0
gender        0
fever        10
cough         0
city          0
has_covid     0
dtype: int64

In [31]:
df1['city'].value_counts()

city
Kolkata      32
Bangalore    30
Delhi        22
Mumbai       16
Name: count, dtype: int64

In [27]:
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split(df1.iloc[:,:5],df1.iloc[:,-1],test_size=0.2)
X_train.head(1)

Unnamed: 0,age,gender,fever,cough,city
48,66,Male,99.0,Strong,Bangalore


In [28]:
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer  ## For missing val to bin with mean

transformer = ColumnTransformer(transformers=[('tnf1',SimpleImputer(),['fever'])
                                              ,('tnf2',OrdinalEncoder(categories=[['Mild','Strong']]),['cough'])
                                              ,('tnf3',OneHotEncoder(sparse=False,dtype=np.int32,drop='first'),['gender','city'])
                                              ],remainder='passthrough')
# Transformer :: tuple : on which to apply, remainder: rest left : either drop or passthrough

In [29]:
transformer.fit_transform(X_train).shape




(80, 7)

In [30]:
transformer.transform(X_test).shape

(20, 7)