# **DATA ENCODING**
data encoding means conversion of categorical features into numerical features for to train the model and get appropriate prediction.


- Data encoding refers to the process of converting categorical or textual data into a numerical format that machine learning algorithms can work with effectively.
- Machine learning algorithms typically require numerical input data, so encoding categorical or textual features is essential for training models on such data.

**Types of data encoding**
- 1) Nominal/One Hot Encoding(OHE)
- 2) Ordinal and label encoding
- 3) target guided ordinal encoding

## 1) **Nominal / One Hot Encoding (OHE)**

- One-hot encoding is a technique used to convert categorical variables into binary vectors where each category is represented by a binary value (0 or 1).
- For each categorical feature, a binary vector is created with a length equal to the number of unique categories in that feature.
- The binary vector has a value of 1 in the position corresponding to the category of the instance and 0 in all other positions.
- One-hot encoding prevents the model from assuming any ordinal relationship between categories and is suitable for features where there is no inherent order.
- However, it can lead to high-dimensional sparse representations, especially if the categorical feature has many unique categories.

In [None]:
#import the libraries
import pandas as pd
import numpy as np
import seaborn as sns
from sklearn.preprocessing import OneHotEncoder

In [None]:
#create a sample dataframe with categorical variable
df = pd.DataFrame({
    "colour" : ["red", "blue", "green", "green", "blue", "red"]
})

In [None]:
df

Unnamed: 0,colour
0,red
1,blue
2,green
3,green
4,blue
5,red


In [None]:
#create a instance of one hot encoder
encoder = OneHotEncoder()

In [None]:
#fit and transform the encoder at same time
encoded = encoder.fit_transform(df[["colour"]])

In [None]:
encoded

<6x3 sparse matrix of type '<class 'numpy.float64'>'
	with 6 stored elements in Compressed Sparse Row format>

In [None]:
encoded_df = pd.DataFrame(encoded.toarray(), columns = encoder.get_feature_names_out())

In [None]:
encoded_df

Unnamed: 0,colour_blue,colour_green,colour_red
0,0.0,0.0,1.0
1,1.0,0.0,0.0
2,0.0,1.0,0.0
3,0.0,1.0,0.0
4,1.0,0.0,0.0
5,0.0,0.0,1.0


In [None]:
encoder.get_feature_names_out()

array(['colour_blue', 'colour_green', 'colour_red'], dtype=object)

In [None]:
#let's concate the df and and new binary vector
pd.concat([df, encoded_df]) #it union with diffrent tables

Unnamed: 0,colour,colour_blue,colour_green,colour_red
0,red,,,
1,blue,,,
2,green,,,
3,green,,,
4,blue,,,
5,red,,,
0,,0.0,0.0,1.0
1,,1.0,0.0,0.0
2,,0.0,1.0,0.0
3,,0.0,1.0,0.0


In [None]:
pd.concat([df, encoded_df], axis = 1) #axis = 1 creates vertically common table

Unnamed: 0,colour,colour_blue,colour_green,colour_red
0,red,0.0,0.0,1.0
1,blue,1.0,0.0,0.0
2,green,0.0,1.0,0.0
3,green,0.0,1.0,0.0
4,blue,1.0,0.0,0.0
5,red,0.0,0.0,1.0


## **2) Label and ordinal encoding**
**Label Encoding:**

- Label encoding is a technique used to convert categorical variables into numerical labels, where each category is assigned a unique integer.
- Each category in the categorical feature is mapped to a numerical label starting from 0 or 1 up to the number of unique categories minus one.
- Label encoding can be useful for ordinal categorical variables where there is a natural order among categories.
- However, it may introduce unintended ordinality in non-ordinal categorical features, leading to misinterpretation by the model.

In [None]:
#import libary required for label encoding
from sklearn.preprocessing import LabelEncoder

In [None]:
#create a categorical dataframe for label encoding
df = pd.DataFrame({
    "color" : ["red", "yellow","blue", "green", "green", "blue", "red", "blue", "yellow", "green", "yellow"]
})

In [None]:
df

Unnamed: 0,color
0,red
1,yellow
2,blue
3,green
4,green
5,blue
6,red
7,blue
8,yellow
9,green


In [None]:
#create the instances
encoder = LabelEncoder()

In [None]:
encoder.fit_transform(df["color"])

array([2, 3, 0, 1, 1, 0, 2, 0, 3, 1, 3])

**Ordinal Encoding:**

- Ordinal encoding is similar to label encoding but explicitly specifies the mapping between categories and numerical labels based on their ordinal relationship.
- This technique is suitable for ordinal categorical variables where the categories have a meaningful order.
- Ordinal encoding preserves the ordinality of categories, allowing the model to capture the inherent hierarchy among them.

In [None]:
#import library required for ordinal encoder
from sklearn.preprocessing import OrdinalEncoder

In [None]:
#create a categorical dataframe for ordinal encoding
df = pd.DataFrame({
    "color" : ["red", "yellow","blue", "green", "green", "blue", "red", "blue", "yellow", "green", "yellow"]
})

In [None]:
#create the instance with categories for encoding the dataset
encoder = OrdinalEncoder(categories = [["yellow", "red", "green", "blue"]])

In [None]:
#fit and transform the dataset
encoder.fit_transform(df[["color"]])

array([[1.],
       [0.],
       [3.],
       [2.],
       [2.],
       [3.],
       [1.],
       [3.],
       [0.],
       [2.],
       [0.]])

## **3) Target guided ordinal encoding**

- Target-guided ordinal encoding is a data encoding technique used primarily for categorical variables in supervised machine learning tasks, especially when the target variable is ordinal or categorical.
- This technique aims to encode categorical variables based on the relationship between the categories and the target variable, thus capturing the ordinality or predictive power of the categories with respect to the target variable.
- here, we replace each category in categorical variable with a numerical value based on the mean or median of target variable for that category

In [None]:
#create a sample dataframe with a categorical variable and target variable
df = pd.DataFrame({
    "city" : ["Mumbai", "Delhi", "Nagpur", "Pune", "Mumbai", "Nagpur"],
    "price" : [200, 150, 300, 250, 180, 320]
})

In [None]:
df

Unnamed: 0,city,price
0,Mumbai,200
1,Delhi,150
2,Nagpur,300
3,Pune,250
4,Mumbai,180
5,Nagpur,320


In [None]:
##calculate mean price for each city
df.groupby("city")["price"].mean()

city
Delhi     150.0
Mumbai    190.0
Nagpur    310.0
Pune      250.0
Name: price, dtype: float64

In [None]:
#convert into dict form
mean_price = df.groupby("city")["price"].mean().to_dict()

In [None]:
mean_price

{'Delhi': 150.0, 'Mumbai': 190.0, 'Nagpur': 310.0, 'Pune': 250.0}

In [None]:
#add the column which shows the mean price of each city
df["city_encoded"] = df["city"].map(mean_price)

In [None]:
df

Unnamed: 0,city,price,city_encoded
0,Mumbai,200,190.0
1,Delhi,150,150.0
2,Nagpur,300,310.0
3,Pune,250,250.0
4,Mumbai,180,190.0
5,Nagpur,320,310.0
