In [1]:
print('hello ml') # https://www.analyticsvidhya.com/blog/2020/08/types-of-categorical-data-encoding/

hello ml


# Encoding Categorical data

### Categorical data can be classified into several types based on the nature of the categories and the relationships between them. Here are some common types of categorical data:

## type of data

### 1.Nominal Data: Nominal data consists of categories that have no inherent order or ranking. Examples include colors (red, blue, green), gender (male, female), and countries (USA, Canada, France). Nominal data can be encoded using methods such as one-hot encoding or label encoding.

### 2.Ordinal Data: Ordinal data represents categories that have a natural order or hierarchy. The categories can be ranked or ordered based on some criteria. Examples of ordinal data include education levels (high school, college, graduate), survey responses (strongly agree, agree, neutral, disagree, strongly disagree), or customer satisfaction ratings (very satisfied, satisfied, neutral, dissatisfied, very dissatisfied). Ordinal data can be encoded using methods like ordinal encoding.

### 3.Binary Data: Binary data has only two categories or levels. It represents situations where there are two mutually exclusive options or outcomes. Examples include yes/no, true/false, success/failure, or present/absent. Binary data can be encoded using methods such as label encoding (assigning 0 or 1) or one-hot encoding (creating a single binary feature).

### 4.Count Data: Count data represents the frequency or occurrence of categories within a given context. It involves counting the number of occurrences or events falling into different categories. Examples include the number of customer complaints by type (product-related, service-related, billing-related) or the number of defects in different product categories (electronic, mechanical, cosmetic). Count data may not require explicit encoding, but it may need preprocessing or transformation before being used in a machine learning algorithm.

## Methods for encoding categorical data

### 1.Ordinal Encoding: This method assigns a unique integer value to each category based on their order or rank. For example, if you have an education level variable with categories "high school," "college," and "graduate," you can assign the values 1, 2, and 3, respectively. However, this method assumes an inherent order or hierarchy among the categories, which may not always be appropriate.

In [4]:
import numpy as np
from sklearn.preprocessing import OrdinalEncoder

In [7]:
# define data
data = np.array([['red'], ['green'], ['blue']])
data

[['red']
 ['green']
 ['blue']]


In [8]:
# define ordinal encoding
encoder = OrdinalEncoder()

In [9]:
# transform data
result = encoder.fit_transform(data)
result

array([[2.],
       [1.],
       [0.]])

In [11]:
# 2nd example
import pandas as pd

In [12]:
df=pd.DataFrame({'height':['tall','medium','short','tall','medium','short','tall','medium','short',]})
df 

Unnamed: 0,height
0,tall
1,medium
2,short
3,tall
4,medium
5,short
6,tall
7,medium
8,short


In [13]:
df['transformed'] = encoder.fit_transform(df)
df

Unnamed: 0,height,transformed
0,tall,2.0
1,medium,0.0
2,short,1.0
3,tall,2.0
4,medium,0.0
5,short,1.0
6,tall,2.0
7,medium,0.0
8,short,1.0


### 2.Label Encoding: Label encoding assigns a unique integer value to each category, similar to ordinal encoding. However, label encoding does not assume any order or hierarchy among the categories. Each category is assigned a unique integer label, starting from 0 or 1. This method can be useful when the categories do not have a natural order.

In [14]:
from sklearn.preprocessing import LabelEncoder 

In [15]:
# define label encoding
le = LabelEncoder()

In [27]:
df = {"Gender" : ['F', 'M', 'M', 'F', 'M', 'F', 'M', 'F', 'F', 'M'],  
      "Name" : ['Cindy', 'Carl', 'Johnny', 'Stacey', 'Andy', 'Sara', 'Victor', 'Martha', 'Mindy', 'Max']}  
df

{'Gender': ['F', 'M', 'M', 'F', 'M', 'F', 'M', 'F', 'F', 'M'],
 'Name': ['Cindy',
  'Carl',
  'Johnny',
  'Stacey',
  'Andy',
  'Sara',
  'Victor',
  'Martha',
  'Mindy',
  'Max']}

In [30]:
df['LEData-gender'] = le.fit_transform(df['Gender'])
df

{'Gender': ['F', 'M', 'M', 'F', 'M', 'F', 'M', 'F', 'F', 'M'],
 'Name': ['Cindy',
  'Carl',
  'Johnny',
  'Stacey',
  'Andy',
  'Sara',
  'Victor',
  'Martha',
  'Mindy',
  'Max'],
 'LEData': array([0, 1, 1, 0, 1, 0, 1, 0, 0, 1], dtype=int64),
 'LEData-gender': array([0, 1, 1, 0, 1, 0, 1, 0, 0, 1], dtype=int64)}

In [32]:
df['LEData-Name'] = le.fit_transform(df['Name'])
df

{'Gender': ['F', 'M', 'M', 'F', 'M', 'F', 'M', 'F', 'F', 'M'],
 'Name': ['Cindy',
  'Carl',
  'Johnny',
  'Stacey',
  'Andy',
  'Sara',
  'Victor',
  'Martha',
  'Mindy',
  'Max'],
 'LEData': array([0, 1, 1, 0, 1, 0, 1, 0, 0, 1], dtype=int64),
 'LEData-gender': array([0, 1, 1, 0, 1, 0, 1, 0, 0, 1], dtype=int64),
 'LEData-Name': array([2, 1, 3, 8, 0, 7, 9, 4, 6, 5], dtype=int64)}

In [33]:
#2nd 
df=pd.DataFrame({'height':['tall','medium','short','tall','medium','short','tall','medium','short',]})
df 

Unnamed: 0,height
0,tall
1,medium
2,short
3,tall
4,medium
5,short
6,tall
7,medium
8,short


In [42]:
df['LEData'] = le.fit(df['height'])
transformed_data = le.transform(df['height'])
transformed_data

array([2, 0, 1, 2, 0, 1, 2, 0, 1])

In [44]:
# use fit_transform to direct transfrom data in values
df['LEData'] = le.fit_transform(df['height'])
df

Unnamed: 0,height,LEData
0,tall,2
1,medium,0
2,short,1
3,tall,2
4,medium,0
5,short,1
6,tall,2
7,medium,0
8,short,1


### 3. One-Hot Encoding: One-hot encoding creates a binary feature for each category, where a value of 1 represents the presence of the category, and 0 represents its absence. For example, if you have a color variable with categories "red," "blue," and "green," one-hot encoding would create three separate binary features: red (1 or 0), blue (1 or 0), and green (1 or 0). This method avoids introducing an order or hierarchy among the categories. 

In [45]:
from sklearn.preprocessing import OneHotEncoder

In [62]:
# define one hot encoding
ohe = OneHotEncoder()
ohe

OneHotEncoder()

In [74]:
df = pd.DataFrame({"Gender" : ['F', 'M', 'M', 'F', 'M', 'F', 'M', 'F', 'F', 'M'],  
      "Name" : ['Cindy', 'Carl', 'Johnny', 'Stacey', 'Andy', 'Sara', 'Victor', 'Martha', 'Mindy', 'Max']})
df

Unnamed: 0,Gender,Name
0,F,Cindy
1,M,Carl
2,M,Johnny
3,F,Stacey
4,M,Andy
5,F,Sara
6,M,Victor
7,F,Martha
8,F,Mindy
9,M,Max


In [75]:
df1 = ohe.fit_transform(df).toarray()
df1

array([[1., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0.],
       [0., 1., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0.],
       [0., 1., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0.],
       [1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0.],
       [0., 1., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
       [1., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0.],
       [0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1.],
       [1., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0.],
       [1., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0.],
       [0., 1., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0.]])

In [89]:
df1 = ohe.fit_transform(df[['Gender']]).toarray()
df2 = ohe.fit_transform(df[['Name']]).toarray()
print(df1)
print(df2)

[[1. 0.]
 [0. 1.]
 [0. 1.]
 [1. 0.]
 [0. 1.]
 [1. 0.]
 [0. 1.]
 [1. 0.]
 [1. 0.]
 [0. 1.]]
[[0. 0. 1. 0. 0. 0. 0. 0. 0. 0.]
 [0. 1. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 1. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 1. 0.]
 [1. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 1. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 1.]
 [0. 0. 0. 0. 1. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 1. 0. 0. 0.]
 [0. 0. 0. 0. 0. 1. 0. 0. 0. 0.]]


In [90]:
# 2nd example
df = pd.DataFrame({'animal':['cat','dog','horse']})
df

Unnamed: 0,animal
0,cat
1,dog
2,horse


In [91]:
df1 = ohe.fit_transform(df).toarray()
df1

array([[1., 0., 0.],
       [0., 1., 0.],
       [0., 0., 1.]])

### 4.Binary Encoding: Binary encoding represents each category with binary digits. Each category is assigned a unique binary code, and each digit in the code represents the presence or absence of a particular category. This method reduces the dimensionality of the encoded data compared to one-hot encoding.

In [95]:
import category_encoders as ce

In [97]:
# Create the original dataframe
df = pd.DataFrame({
    "Gender": ['F', 'M', 'M', 'F', 'M', 'F', 'M', 'F', 'F', 'M'],
    "Name": ['Cindy', 'Carl', 'Johnny', 'Stacey', 'Andy', 'Sara', 'Victor', 'Martha', 'Mindy', 'Max']
})
df

Unnamed: 0,Gender,Name
0,F,Cindy
1,M,Carl
2,M,Johnny
3,F,Stacey
4,M,Andy
5,F,Sara
6,M,Victor
7,F,Martha
8,F,Mindy
9,M,Max


In [98]:
# Perform binary encoding
encoder = ce.BinaryEncoder(cols=['Gender'])
df_encoded = encoder.fit_transform(df)
df_encoded

Unnamed: 0,Gender_0,Gender_1,Name
0,0,1,Cindy
1,1,0,Carl
2,1,0,Johnny
3,0,1,Stacey
4,1,0,Andy
5,0,1,Sara
6,1,0,Victor
7,0,1,Martha
8,0,1,Mindy
9,1,0,Max


### 3.Target Encoding: Target encoding, also known as mean encoding, replaces each category with the mean of the target variable (the variable you want to predict) for that category. This method can be useful when the target variable exhibits a clear relationship with the categorical variable.

In [110]:
#Create the Dataframe
data=pd.DataFrame({'class':['A,','B','C','B','C','A','A','A'],'Marks':[50,30,70,80,45,97,80,68]})
data

Unnamed: 0,class,Marks
0,"A,",50
1,B,30
2,C,70
3,B,80
4,C,45
5,A,97
6,A,80
7,A,68


In [111]:
#Create target encoding object
encoder=ce.TargetEncoder(cols='class') 

In [112]:
#Fit and Transform Train Data
data = encoder.fit_transform(data['class'],data['Marks'])
data

Unnamed: 0,class
0,63.048373
1,63.581489
2,63.936117
3,63.581489
4,63.936117
5,67.574421
6,67.574421
7,67.574421
