# Data Encoding

### 1. Nominal/OHE Encoding
### 2. Label and Ordinal Encoding
### 3. Target guided Ordinal Encoding

### Nominal/ OHE Encoding
One Hot Encoding , is a technique used to represent categorical data as numerical data ( in form of binary vector ), which is more suitable for machine learning
algorithms.

In [24]:
import pandas as pd
from sklearn.preprocessing import OneHotEncoder

In [25]:
## Create a simple dataframe
df = pd.DataFrame({
    'color' : ['red', 'blue', 'green', 'green', 'red', 'blue']
})

In [26]:
df.head()

Unnamed: 0,color
0,red
1,blue
2,green
3,green
4,red


In [27]:
## Create an instance of OneHotEncoder

encoder = OneHotEncoder()

## perform fit and transform

encoded = encoder.fit_transform(df[['color']]).toarray()

In [28]:
encoded_df = pd.DataFrame(encoded, columns=encoder.get_feature_names_out())

In [29]:
encoded_df

Unnamed: 0,color_blue,color_green,color_red
0,0.0,0.0,1.0
1,1.0,0.0,0.0
2,0.0,1.0,0.0
3,0.0,1.0,0.0
4,0.0,0.0,1.0
5,1.0,0.0,0.0


In [30]:
## for new data
encoder.transform([['blue']]).toarray()



array([[1., 0., 0.]])

In [31]:
pd.concat([df, encoded_df], axis=1)

Unnamed: 0,color,color_blue,color_green,color_red
0,red,0.0,0.0,1.0
1,blue,1.0,0.0,0.0
2,green,0.0,1.0,0.0
3,green,0.0,1.0,0.0
4,red,0.0,0.0,1.0
5,blue,1.0,0.0,0.0


## Label Encoding

Label Encoding and ordinal encoding are two techniques used to encode categorical data as numerical data

Label encoding involves assigning a unique numerical label to each category in the variable. The labels are usually assigned in alphabetical order or based on frequency of the categories .

In [32]:
df.head()

Unnamed: 0,color
0,red
1,blue
2,green
3,green
4,red


In [33]:
from sklearn.preprocessing import LabelEncoder
lbl_encoder = LabelEncoder()

In [35]:
lbl_encoder.fit_transform(df[['color']])

  y = column_or_1d(y, warn=True)


array([2, 0, 1, 1, 2, 0])

In [38]:
lbl_encoder.transform([['red']]), lbl_encoder.transform([['blue']]), lbl_encoder.transform([['green']])

  y = column_or_1d(y, dtype=self.classes_.dtype, warn=True)
  y = column_or_1d(y, dtype=self.classes_.dtype, warn=True)
  y = column_or_1d(y, dtype=self.classes_.dtype, warn=True)


(array([2]), array([0]), array([1]))

## Ordinal Encoding
Used to encode categorical data that have an intrinsic order or ranking. In this technique , each category is assigned a numerical value based on its position in the order.

In [39]:
## Ordinal Encoding
from sklearn.preprocessing import OrdinalEncoder

In [40]:
# create a simple dataframe with Ordinal Variable

df = pd.DataFrame({
    'size' : ['small', 'medium', 'large', 'medium', 'small', 'large']
})

In [42]:
df

Unnamed: 0,size
0,small
1,medium
2,large
3,medium
4,small
5,large


In [44]:
## create an instance of Ordinal Encoder and then fit_transform
encoder = OrdinalEncoder(categories=[['small', 'medium', 'large']])
encoder.fit_transform(df[['size']])

array([[0.],
       [1.],
       [2.],
       [1.],
       [0.],
       [2.]])

In [45]:
encoder.transform([['small']])



array([[0.]])

### Target Guided Ordinal Encoding

It is a technique used to encode categorical variables based on their relationship with target variable. This encoding technique is useful when we have a categorical variable with a large number of unique categories, and we want to use this variable as a feature in our machine learning model.

In Target Guided Ordinal Encoding , we replace each category in the categorical variable with a numeric value representing the mean or median of the target variable for that category . This creates a monotonic relationship between the categorical variable and the target variable , which can improve the predictive power of our model.

In [1]:
import pandas as pd

## Create a simple dataframe with a categorical variable and a target variable

df = pd.DataFrame({
    'city' : ['New York', 'London', 'Paris', 'Tokyo', 'New York', 'Paris'],
    'price' : [200, 150, 300, 250, 180, 320]
});



In [2]:
df

Unnamed: 0,city,price
0,New York,200
1,London,150
2,Paris,300
3,Tokyo,250
4,New York,180
5,Paris,320


In [5]:
mean_price = df.groupby('city')['price'].mean().to_dict()

In [6]:
mean_price

{'London': 150.0, 'New York': 190.0, 'Paris': 310.0, 'Tokyo': 250.0}

In [7]:
df['city_encoded'] = df['city'].map(mean_price)

In [9]:
df[['city', 'city_encoded']]

Unnamed: 0,city,city_encoded
0,New York,190.0
1,London,150.0
2,Paris,310.0
3,Tokyo,250.0
4,New York,190.0
5,Paris,310.0


In [11]:
import seaborn as sns
df = sns.load_dataset('tips')

In [12]:
df

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
0,16.99,1.01,Female,No,Sun,Dinner,2
1,10.34,1.66,Male,No,Sun,Dinner,3
2,21.01,3.50,Male,No,Sun,Dinner,3
3,23.68,3.31,Male,No,Sun,Dinner,2
4,24.59,3.61,Female,No,Sun,Dinner,4
...,...,...,...,...,...,...,...
239,29.03,5.92,Male,No,Sat,Dinner,3
240,27.18,2.00,Female,Yes,Sat,Dinner,2
241,22.67,2.00,Male,Yes,Sat,Dinner,2
242,17.82,1.75,Male,No,Sat,Dinner,2


In [13]:
mean_total = df.groupby('time')['total_bill'].mean().to_dict()

  mean_total = df.groupby('time')['total_bill'].mean().to_dict()


In [14]:
mean_total

{'Lunch': 17.168676470588235, 'Dinner': 20.79715909090909}

In [15]:
df['time_encoded'] = df['time'].map(mean_total)

In [16]:
df[['time', 'time_encoded']]

Unnamed: 0,time,time_encoded
0,Dinner,20.797159
1,Dinner,20.797159
2,Dinner,20.797159
3,Dinner,20.797159
4,Dinner,20.797159
...,...,...
239,Dinner,20.797159
240,Dinner,20.797159
241,Dinner,20.797159
242,Dinner,20.797159
