## Encoding categorical features

1. To convert categorical features to such integer codes, we can use the **OrdinalEncoder**. This estimator transforms each categorical feature to one new feature of integers (0 to n_categories - 1)

     - *Retains the order of categories when encoding **ordinal data**, which can be ranked or ordered. For example, education levels (high school, bachelor's, master's, Ph.D.) or temperature categories (cold, warm, hot).* </br></br>

2. Another possibility to convert categorical features to features that can be used with scikit-learn estimators is to use a one-of-K, also known as one-hot or dummy encoding. This type of encoding can be obtained with the **OneHotEncoder**, which transforms each categorical feature with n_categories possible values into n_categories binary features, with one of them 1, and all others 0.
     - *Considers the presence or absence of a feature when encoding **nominal data**, which has categories with no intrinsic order or ranking. For example, colors (red, blue, green), types of animals (mammal, fish, reptile, amphibian, or bird), brand names (Coca-Cola, Pepsi, Sprite), or pizza toppings (pepperoni, mushrooms, onions).* </br></br>
     
     
3. **LabelEncoder** encode target labels with value between 0 and n_classes-1

[ref](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OrdinalEncoder.html)
[ref](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html)
[ref](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html)

In [1]:
# import libraries
import pandas as pd
import numpy as np

In [4]:
# import dataset
df = pd.read_csv('customer.csv')
df = df.iloc[:,2:]
df.sample(5)

Unnamed: 0,review,education,purchased
18,Good,School,No
38,Good,School,No
1,Poor,UG,No
45,Poor,PG,Yes
34,Average,School,No


In [6]:
# Train Test Split
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(df[['review', 'education']],
                                                    df['purchased'],
                                                    test_size=0.3,
                                                    random_state=0)
X_train.shape, X_test.shape

((35, 2), (15, 2))

In [7]:
X_train.head()

Unnamed: 0,review,education
7,Poor,School
14,Poor,PG
45,Poor,PG
48,Good,UG
29,Average,UG


### OrdinalEncoder

In [10]:
# import encoder
from sklearn.preprocessing import OrdinalEncoder

# initialize encoder
oe = OrdinalEncoder(categories=[['Poor','Average','Good'],['School','UG','PG']])

# fit and transform
X_train_enc = oe.fit_transform(X_train)
X_test_enc = oe.transform(X_test)

X_train_enc[:5]

array([[0., 0.],
       [0., 2.],
       [0., 2.],
       [2., 1.],
       [1., 1.]])

*We can check the encoded values according to ordinal data*

In [11]:
# check categories
oe.categories_

[array(['Poor', 'Average', 'Good'], dtype=object),
 array(['School', 'UG', 'PG'], dtype=object)]

### Label Encoding

In [21]:
from sklearn.preprocessing import LabelEncoder

# initialize encoder
le = LabelEncoder()

# fit
le.fit(y_train)

In [22]:
# check classes
le.classes_

array([0, 1], dtype=int64)

In [23]:
# encode train
y_train_trans = le.transform(y_train)
y_train_trans

array([1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1,
       1, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 1, 0], dtype=int64)

## OneHotEncoding for Nominal Categorical Variables

In [32]:
# load dataset
df = pd.read_csv('cars.csv')
df.head(5)

Unnamed: 0,brand,km_driven,fuel,owner,selling_price
0,Maruti,145500,Diesel,First Owner,450000
1,Skoda,120000,Diesel,Second Owner,370000
2,Honda,140000,Petrol,Third Owner,158000
3,Hyundai,127000,Diesel,First Owner,225000
4,Maruti,120000,Petrol,First Owner,130000


In [26]:
# check counts
df['owner'].value_counts()

First Owner             5289
Second Owner            2105
Third Owner              555
Fourth & Above Owner     174
Test Drive Car             5
Name: owner, dtype: int64

In [31]:
df['fuel'].value_counts()

Diesel    4402
Petrol    3631
CNG         57
LPG         38
Name: fuel, dtype: int64

### OneHotEncoding using Pandas

In [30]:
pd.get_dummies(df,columns=['fuel','owner'])

Unnamed: 0,brand,km_driven,selling_price,fuel_CNG,fuel_Diesel,fuel_LPG,fuel_Petrol,owner_First Owner,owner_Fourth & Above Owner,owner_Second Owner,owner_Test Drive Car,owner_Third Owner
0,Maruti,145500,450000,0,1,0,0,1,0,0,0,0
1,Skoda,120000,370000,0,1,0,0,0,0,1,0,0
2,Honda,140000,158000,0,0,0,1,0,0,0,0,1
3,Hyundai,127000,225000,0,1,0,0,1,0,0,0,0
4,Maruti,120000,130000,0,0,0,1,1,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...
8123,Hyundai,110000,320000,0,0,0,1,1,0,0,0,0
8124,Hyundai,119000,135000,0,1,0,0,0,1,0,0,0
8125,Maruti,120000,382000,0,1,0,0,1,0,0,0,0
8126,Tata,25000,290000,0,1,0,0,1,0,0,0,0


**Created seperate columns for each category**

### K-1 OneHotEncoding with pandas

To remove the first dummy variable, we can use the `drop_first` parameter in the OneHotEncoder.

To solve Multicollinearity, we can use the `drop_first` parameter in the OneHotEncoder.

In [33]:
pd.get_dummies(df,columns=['fuel','owner'],drop_first=True)

Unnamed: 0,brand,km_driven,selling_price,fuel_Diesel,fuel_LPG,fuel_Petrol,owner_Fourth & Above Owner,owner_Second Owner,owner_Test Drive Car,owner_Third Owner
0,Maruti,145500,450000,1,0,0,0,0,0,0
1,Skoda,120000,370000,1,0,0,0,1,0,0
2,Honda,140000,158000,0,0,1,0,0,0,1
3,Hyundai,127000,225000,1,0,0,0,0,0,0
4,Maruti,120000,130000,0,0,1,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...
8123,Hyundai,110000,320000,0,0,1,0,0,0,0
8124,Hyundai,119000,135000,1,0,0,1,0,0,0
8125,Maruti,120000,382000,1,0,0,0,0,0,0
8126,Tata,25000,290000,1,0,0,0,0,0,0


### OneHotEncoding using Sklearn

In [37]:
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split(df.iloc[:,0:4],df.iloc[:,-1],test_size=0.2,random_state=2)

X_train.head()

Unnamed: 0,brand,km_driven,fuel,owner
5571,Hyundai,35000,Diesel,First Owner
2038,Jeep,60000,Diesel,First Owner
2957,Hyundai,25000,Petrol,First Owner
7618,Mahindra,130000,Diesel,Second Owner
6684,Hyundai,155000,Diesel,First Owner


In [40]:
from sklearn.preprocessing import OneHotEncoder
ohe = OneHotEncoder(drop='first',sparse=False,dtype=np.int32)
X_train_new = ohe.fit_transform(X_train[['fuel','owner']])
X_test_new = ohe.transform(X_test[['fuel','owner']])
X_train_new



array([[1, 0, 0, ..., 0, 0, 0],
       [1, 0, 0, ..., 0, 0, 0],
       [0, 0, 1, ..., 0, 0, 0],
       ...,
       [0, 0, 1, ..., 0, 0, 0],
       [1, 0, 0, ..., 1, 0, 0],
       [1, 0, 0, ..., 0, 0, 0]])

In [41]:
# horizontalstacking of onehotencoded columns and without onehotencoding columns
np.hstack((X_train[['brand','km_driven']].values,X_train_new))

array([['Hyundai', 35000, 1, ..., 0, 0, 0],
       ['Jeep', 60000, 1, ..., 0, 0, 0],
       ['Hyundai', 25000, 0, ..., 0, 0, 0],
       ...,
       ['Tata', 15000, 0, ..., 0, 0, 0],
       ['Maruti', 32500, 1, ..., 1, 0, 0],
       ['Isuzu', 121000, 1, ..., 0, 0, 0]], dtype=object)

### OneHotEncoding with Top Categories

In [46]:
# value counts of brand
counts = df['brand'].value_counts()

In [47]:
threshold = 100
df['brand'].nunique()

32

In [49]:
# select brands that appear less than 100 times
repl = counts[counts <= threshold].index
repl

Index(['Nissan', 'Jaguar', 'Volvo', 'Datsun', 'Mercedes-Benz', 'Fiat', 'Audi',
       'Lexus', 'Jeep', 'Mitsubishi', 'Force', 'Land', 'Isuzu', 'Kia',
       'Ambassador', 'Daewoo', 'MG', 'Ashok', 'Opel', 'Peugeot'],
      dtype='object')

In [53]:
pd.get_dummies(df['brand'].replace(repl, 'uncommon')).sample(5)

Unnamed: 0,BMW,Chevrolet,Ford,Honda,Hyundai,Mahindra,Maruti,Renault,Skoda,Tata,Toyota,Volkswagen,uncommon
5069,0,0,0,0,0,0,0,0,0,0,0,1,0
5304,0,1,0,0,0,0,0,0,0,0,0,0,0
5580,0,0,0,0,1,0,0,0,0,0,0,0,0
3266,0,0,0,0,1,0,0,0,0,0,0,0,0
5317,0,0,0,0,1,0,0,0,0,0,0,0,0
