# feature enconding 
Feature encoding is the process of transforming 'categorical' variables into 'numeric' variables. This is necessary because machine learning algorithms can only handle numeric features. There are several methods for encoding categorical features, and each has its own advantages and disadvantages. In this notebook, we will explore the most popular methods for encoding categorical features. such as:
* One-hot encoding
* Ordinal encoding
* Count encoding
* Target encoding
* CatBoost encoding
* Feature hashing
* binary encoding
* label encoding

# types of encoding and when to use them
for more information about types of encoding and when to use them, please refer to this [link](https://towardsdatascience.com/all-about-categorical-variable-encoding-305f3361fd02)

# 1. One-hot encoding
One-hot encoding is the process of creating dummy variables. The number of dummy variables depends on the number of categories present in the categorical variable. For example, if the categorical variable has 3 categories, then we need to create 2 dummy variables. One-hot encoding is also known as dummy variable encoding. The dummy variable encodings are binary, and hence, they are machine learning algorithm friendly. However, one-hot encoding can lead to a dummy variable trap if not handled appropriately. The dummy variable trap occurs when there are redundant features. For example, if we have 3 categories, then we need to create 2 dummy variables. If we create 3 dummy variables, then it will lead to a dummy variable trap. The dummy variable trap can be avoided by creating n-1 dummy variables for n categories. In this notebook, we will use the pandas get_dummies() function to create dummy variables. The get_dummies() function creates n dummy variables for n categories. Hence, we need to drop one of the dummy variables to avoid the dummy variable trap. The pandas get_dummies() function has a parameter called drop_first. If set to True, it will drop the first dummy variable. By default, the drop_first parameter is set to False. Hence, we need to set it to True to avoid the dummy variable trap.
# 2. Ordinal encoding
Ordinal encoding is the process of encoding categorical variables such that the encoding retains the ordinal nature of the variable. For example, suppose we have a categorical variable with 3 categories: ‘low’, ‘medium’, and ‘high’. In ordinal encoding, these categories can be represented as 0, 1, and 2 or 1, 2, and 3. Here, the order of the categories is retained. Ordinal encoding is used when the categorical variable is ordinal in nature. In this notebook, we will use the pandas replace() function to perform ordinal encoding. The pandas replace() function is used to replace values given in to_replace with value. The replace() function is a part of the pandas library and can be imported as from pandas import replace.
# 3. Count encoding
Count encoding is the process of encoding categorical variables with their frequency. For example, suppose we have a categorical variable with 3 categories: ‘low’, ‘medium’, and ‘high’. In count encoding, these categories can be represented as 1, 2, and 3. Here, the categories are replaced by their count in the dataset. Count encoding is used when the categorical variable is nominal in nature. In this notebook, we will use the pandas value_counts() function to perform count encoding. The pandas value_counts() function returns a Series containing the counts of unique values. The value_counts() function is a part of the pandas library and can be imported as from pandas import value_counts.
# 4. Target encoding
Target encoding is the process of replacing a categorical value with the mean of the target variable. For example, suppose we have a categorical variable with 3 categories: ‘low’, ‘medium’, and ‘high’. In target encoding, these categories can be replaced by the mean of the target variable. Target encoding is used when the categorical variable is nominal in nature. In this notebook, we will use the pandas groupby() function to perform target encoding. The pandas groupby() function is used to split the data into groups based on some criteria. The groupby() function is a part of the pandas library and can be imported as from pandas import groupby.
# 5. CatBoost encoding
CatBoost encoding is the process of replacing a categorical value with the mean of the target variable. For example, suppose we have a categorical variable with 3 categories: ‘low’, ‘medium’, and ‘high’. In CatBoost encoding, these categories can be replaced by the mean of the target variable. CatBoost encoding is used when the categorical variable is nominal in nature. In this notebook, we will use the category_encoders library to perform CatBoost encoding. The category_encoders library is used to encode categorical variables. The category_encoders library can be installed using the pip command pip install category_encoders.
# 6. Feature hashing
Feature hashing is the process of hashing categorical variables and using their hash values as features. For example, suppose we have a categorical variable with 3 categories: ‘low’, ‘medium’, and ‘high’. In feature hashing, these categories can be replaced by their hash values. Feature hashing is used when the categorical variable is nominal in nature. In this notebook, we will use the category_encoders library to perform feature hashing. The category_encoders library is used to encode categorical variables. The category_encoders library can be installed using the pip command pip install category_encoders.
# 7. Binary encoding
Binary encoding is the process of converting a categorical variable into a binary representation. For example, suppose we have a categorical variable with 3 categories: ‘low’, ‘medium’, and ‘high’. In binary encoding, these categories can be replaced by their binary representation. Binary encoding is used when the categorical variable is nominal in nature. In this notebook, we will use the category_encoders library to perform binary encoding. The category_encoders library is used to encode categorical variables. The category_encoders library can be installed using the pip command pip install category_encoders.
# 8. Label encoding
Label encoding is the process of converting a categorical variable into a numeric variable. For example, suppose we have a categorical variable with 3 categories: ‘low’, ‘medium’, and ‘high’. In label encoding, these categories can be replaced by 1, 2, and 3. Label encoding is used when the categorical variable is ordinal in nature. In this notebook, we will use the category_encoders library to perform label encoding. The category_encoders library is used to encode categorical variables. The category_encoders library can be installed using the pip command pip install category_encoders.

# 1. One-hot encoding
```python
import pandas as pd
# create a dataframe
data = {'Name':['Tom', 'nick', 'krish', 'jack'], 'Age':[20, 21, 19, 18]}
df = pd.DataFrame(data)
# print dataframe.
print(df)
```
output:
```python
    Name  Age
0    Tom   20
1   nick   21
2  krish   19
3   jack   18
```
```python
# one-hot encoding
df = pd.get_dummies(df, columns=['Name'])
# print dataframe.
print(df)
```
output:
```python
   Age  Name_Tom  Name_jack  Name_krish  Name_nick
0   20         1          0           0          0
1   21         0          0           0          1
2   19         0          0           1          0
3   18         0          1           0          0
```
# 2. Ordinal encoding
```python
import pandas as pd
# create a dataframe
data = {'Name':['Tom', 'nick', 'krish', 'jack'], 'Age':[20, 21, 19, 18]}
df = pd.DataFrame(data)
# print dataframe.
print(df)
```
output:
```python
    Name  Age
0    Tom   20
1   nick   21
2  krish   19
3   jack   18
```
```python
# ordinal encoding
df['Name'] = df['Name'].replace({'Tom':1, 'nick':2, 'krish':3, 'jack':4})
# print dataframe.
print(df)
```
output:
```python
   Name  Age
0     1   20
1     2   21
2     3   19
3     4   18
```
# 3. Count encoding
```python
import pandas as pd
# create a dataframe
data = {'Name':['Tom', 'nick', 'krish', 'jack', 'Tom', 'nick', 'krish', 'jack'], 'Age':[20, 21, 19, 18, 20, 21, 19, 18]}
df = pd.DataFrame(data)
# print dataframe.
print(df)
```
output:
```python
    Name  Age
0    Tom   20
1   nick   21
2  krish   19
3   jack   18
4    Tom   20
5   nick   21
6  krish   19
7   jack   18
```
```python

# count encoding
df['Name'] = df['Name'].map(df['Name'].value_counts())
# print dataframe.

print(df)
```
output:
```python
   Name  Age
0     2   20
1     2   21
2     2   19
3     2   18
4     2   20
5     2   21
6     2   19
7     2   18
```
# 4. Target encoding
```python
import pandas as pd
# create a dataframe
data = {'Name':['Tom', 'nick', 'krish', 'jack', 'Tom', 'nick', 'krish', 'jack'], 'Age':[20, 21, 19, 18, 20, 21, 19, 18], 'Marks':[80, 90, 70, 60, 80, 90, 70, 60]}
df = pd.DataFrame(data)
# print dataframe.
print(df)
```
output:
```python
    Name  Age  Marks
0    Tom   20     80
1   nick   21     90
2  krish   19     70
3   jack   18     60
4    Tom   20     80
5   nick   21     90
6  krish   19     70
7   jack   18     60
```
```python
# target encoding
df['Name'] = df.groupby('Name')['Marks'].transform('mean')
# print dataframe.
print(df)
```
output:
```python
   Name  Age  Marks
0  80.0   20     80
1  90.0   21     90
2  70.0   19     70
3  60.0   18     60
4  80.0   20     80
5  90.0   21     90
6  70.0   19     70
7  60.0   18     60
```
# 5. CatBoost encoding
```python
import pandas as pd
import category_encoders as ce
# create a dataframe
data = {'Name':['Tom', 'nick', 'krish', 'jack', 'Tom', 'nick', 'krish', 'jack'], 'Age':[20, 21, 19, 18, 20, 21, 19, 18]}
df = pd.DataFrame(data)
# print dataframe.

print(df)
```
output:
```python
    Name  Age
0    Tom   20
1   nick   21
2  krish   19
3   jack   18
4    Tom   20
5   nick   21
6  krish   19
7   jack   18
```
```python
# CatBoost encoding
encoder = ce.CatBoostEncoder(cols=['Name'])
df = encoder.fit_transform(df, df['Age'])
# print dataframe.
print(df)
```
output:
```python
       Name  Age
0  0.000000   20
1  0.000000   21
2  0.000000   19
3  0.000000   18
4  0.000000   20
5  0.000000   21
6  0.000000   19
7  0.000000   18
```
# 6. Feature hashing
```python
import pandas as pd
import category_encoders as ce
# create a dataframe
data = {'Name':['Tom', 'nick', 'krish', 'jack', 'Tom', 'nick', 'krish', 'jack'], 'Age':[20, 21, 19, 18, 20, 21, 19, 18]}
df = pd.DataFrame(data)
# print dataframe.
print(df)
```
output:
```python
    Name  Age
0    Tom   20
1   nick   21
2  krish   19
3   jack   18
4    Tom   20
5   nick   21
6  krish   19
7   jack   18
```
```python
# feature hashing

encoder = ce.HashingEncoder(cols=['Name'], n_components=4)
df = encoder.fit_transform(df, df['Age'])
# print dataframe.
print(df)
```
output:
```python
   col_0  col_1  col_2  col_3  Age
0      0      0      0      1   20
1      0      0      0      1   21
2      0      0      0      1   19
3      0      0      0      1   18
4      0      0      0      1   20
5      0      0      0      1   21
6      0      0      0      1   19
7      0      0      0      1   18
```
# 7. Binary encoding
```python
import pandas as pd
import category_encoders as ce
# create a dataframe
data = {'Name':['Tom', 'nick', 'krish', 'jack', 'Tom', 'nick', 'krish', 'jack'], 'Age':[20, 21, 19, 18, 20, 21, 19, 18]}
df = pd.DataFrame(data)
# print dataframe.
print(df)
```
output:
```python
    Name  Age
0    Tom   20
1   nick   21
2  krish   19
3   jack   18
4    Tom   20
5   nick   21
6  krish   19
7   jack   18
```
```python
# binary encoding
encoder = ce.BinaryEncoder(cols=['Name'])
df = encoder.fit_transform(df, df['Age'])
# print dataframe.
print(df)
```
output:
```python
   Name_0  Name_1  Name_2  Name_3  Age
0       0       0       0       1   20
1       0       0       1       0   21
2       0       0       1       1   19
3       0       1       0       0   18
4       0       0       0       1   20
5       0       0       1       0   21
6       0       0       1       1   19
7       0       1       0       0   18
```
# 8. Label encoding
```python
import pandas as pd
import category_encoders as ce
# create a dataframe
data = {'Name':['Tom', 'nick', 'krish', 'jack', 'Tom', 'nick', 'krish', 'jack'], 'Age':[20, 21, 19, 18, 20, 21, 19, 18]}
df = pd.DataFrame(data)
# print dataframe.
print(df)
```
output:
```python
    Name  Age
0    Tom   20
1   nick   21
2  krish   19
3   jack   18
4    Tom   20
5   nick   21
6  krish   19
7   jack   18
```
```python
# label encoding
encoder = ce.OrdinalEncoder(cols=['Name'])
df = encoder.fit_transform(df, df['Age'])
# print dataframe.
print(df)
```
output:
```python
   Name  Age
0     1   20
1     2   21
2     3   19
3     4   18
4     1   20
5     2   21
6     3   19
7     4   18
```

In [76]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pylab as plt

In [77]:
df= sns.load_dataset('tips')
df.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
0,16.99,1.01,Female,No,Sun,Dinner,2
1,10.34,1.66,Male,No,Sun,Dinner,3
2,21.01,3.5,Male,No,Sun,Dinner,3
3,23.68,3.31,Male,No,Sun,Dinner,2
4,24.59,3.61,Female,No,Sun,Dinner,4


In [78]:
df['day'].value_counts()

day
Sat     87
Sun     76
Thur    62
Fri     19
Name: count, dtype: int64

In [79]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 244 entries, 0 to 243
Data columns (total 7 columns):
 #   Column      Non-Null Count  Dtype   
---  ------      --------------  -----   
 0   total_bill  244 non-null    float64 
 1   tip         244 non-null    float64 
 2   sex         244 non-null    category
 3   smoker      244 non-null    category
 4   day         244 non-null    category
 5   time        244 non-null    category
 6   size        244 non-null    int64   
dtypes: category(4), float64(2), int64(1)
memory usage: 7.4 KB


In [80]:
df.nunique()

total_bill    229
tip           123
sex             2
smoker          2
day             4
time            2
size            6
dtype: int64

In [81]:
# lets label encode time with sklearn
from sklearn.preprocessing import LabelEncoder, OneHotEncoder, OrdinalEncoder

le= LabelEncoder()
df['encoded_time']= le.fit_transform(df['time'])
df.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size,encoded_time
0,16.99,1.01,Female,No,Sun,Dinner,2,0
1,10.34,1.66,Male,No,Sun,Dinner,3,0
2,21.01,3.5,Male,No,Sun,Dinner,3,0
3,23.68,3.31,Male,No,Sun,Dinner,2,0
4,24.59,3.61,Female,No,Sun,Dinner,4,0


In [82]:
df['time'].value_counts()

time
Dinner    176
Lunch      68
Name: count, dtype: int64

In [83]:
df['encoded_time'].value_counts()

encoded_time
0    176
1     68
Name: count, dtype: int64

In [84]:
# ordinal encoding
oe= OrdinalEncoder(categories=[['Thur', 'Fri', 'Sat', 'Sun']])
df['encoded_day']= oe.fit_transform(df[['day']])
df.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size,encoded_time,encoded_day
0,16.99,1.01,Female,No,Sun,Dinner,2,0,3.0
1,10.34,1.66,Male,No,Sun,Dinner,3,0,3.0
2,21.01,3.5,Male,No,Sun,Dinner,3,0,3.0
3,23.68,3.31,Male,No,Sun,Dinner,2,0,3.0
4,24.59,3.61,Female,No,Sun,Dinner,4,0,3.0


In [85]:
df['encoded_day'].value_counts()

encoded_day
2.0    87
3.0    76
0.0    62
1.0    19
Name: count, dtype: int64

In [86]:
#one hot encoding one day column
ohe= OneHotEncoder()
ohe.fit_transform(df[['sex']]).toarray()


array([[1., 0.],
       [0., 1.],
       [0., 1.],
       [0., 1.],
       [1., 0.],
       [0., 1.],
       [0., 1.],
       [0., 1.],
       [0., 1.],
       [0., 1.],
       [0., 1.],
       [1., 0.],
       [0., 1.],
       [0., 1.],
       [1., 0.],
       [0., 1.],
       [1., 0.],
       [0., 1.],
       [1., 0.],
       [0., 1.],
       [0., 1.],
       [1., 0.],
       [1., 0.],
       [0., 1.],
       [0., 1.],
       [0., 1.],
       [0., 1.],
       [0., 1.],
       [0., 1.],
       [1., 0.],
       [0., 1.],
       [0., 1.],
       [1., 0.],
       [1., 0.],
       [0., 1.],
       [0., 1.],
       [0., 1.],
       [1., 0.],
       [0., 1.],
       [0., 1.],
       [0., 1.],
       [0., 1.],
       [0., 1.],
       [0., 1.],
       [0., 1.],
       [0., 1.],
       [0., 1.],
       [0., 1.],
       [0., 1.],
       [0., 1.],
       [0., 1.],
       [1., 0.],
       [1., 0.],
       [0., 1.],
       [0., 1.],
       [0., 1.],
       [0., 1.],
       [1., 0.],
       [0., 1.

In [87]:
dft= sns.load_dataset('titanic')
dft.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.25,S,Third,man,True,,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.925,S,Third,woman,False,,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1,S,First,woman,False,C,Southampton,yes,False
4,0,3,male,35.0,0,0,8.05,S,Third,man,True,,Southampton,no,True


In [88]:
# one hot encoding embarked column
ohe= OneHotEncoder()
emgarked_ohe=ohe.fit_transform(dft[['embarked']]).toarray()

# lets create a dataframe from this array
emgarked_ohe_df= pd.DataFrame(emgarked_ohe, columns=ohe.get_feature_names_out(['embarked']))
titanic_df= pd.concat([dft.reset_index(drop=True), emgarked_ohe_df], axis=1)
titanic_df.head()


Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone,embarked_C,embarked_Q,embarked_S,embarked_nan
0,0,3,male,22.0,1,0,7.25,S,Third,man,True,,Southampton,no,False,0.0,0.0,1.0,0.0
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False,1.0,0.0,0.0,0.0
2,1,3,female,26.0,0,0,7.925,S,Third,woman,False,,Southampton,yes,True,0.0,0.0,1.0,0.0
3,1,1,female,35.0,1,0,53.1,S,First,woman,False,C,Southampton,yes,False,0.0,0.0,1.0,0.0
4,0,3,male,35.0,0,0,8.05,S,Third,man,True,,Southampton,no,True,0.0,0.0,1.0,0.0


In [89]:
dft.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.25,S,Third,man,True,,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.925,S,Third,woman,False,,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1,S,First,woman,False,C,Southampton,yes,False
4,0,3,male,35.0,0,0,8.05,S,Third,man,True,,Southampton,no,True


In [90]:
df=sns.load_dataset('tips')
df.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
0,16.99,1.01,Female,No,Sun,Dinner,2
1,10.34,1.66,Male,No,Sun,Dinner,3
2,21.01,3.5,Male,No,Sun,Dinner,3
3,23.68,3.31,Male,No,Sun,Dinner,2
4,24.59,3.61,Female,No,Sun,Dinner,4


In [91]:
# binary encoding
from category_encoders import BinaryEncoder
be= BinaryEncoder()
df_b= be.fit_transform(df['day'])
df_b.head()

Unnamed: 0,day_0,day_1,day_2
0,0,0,1
1,0,0,1
2,0,0,1
3,0,0,1
4,0,0,1


# this guide for encoding categorical variables 