### Data Encoding means to represent the categorical data into numerical form.

## Nominal/One Hot Encoding 

#### OHE is a technique used to represent categorical data into numerical data which is more suitable for machine learning algorithms.In this technique, each category is represent as binary vector where each bit corresponds to a unique category. For example , we have a categorical variable "Color" , then we can represent it as : 

#### Red : [1,0,0]
#### Green : [0,1,0]
#### Blue : [0,0,1]

In [2]:
import pandas as pd
from sklearn.preprocessing import OneHotEncoder

In [4]:
df = pd.DataFrame({
    'color' : ['red','blue','green','green','red','blue']
})

In [5]:
# create an intance of onehot encoder
encoder = OneHotEncoder()

In [8]:
## perfrom fit and transform 
encoded_Values = encoder.fit_transform(df[['color']]).toarray()

In [9]:
import pandas as pd
encoder_df = pd.DataFrame(encoded_Values,columns=encoder.get_feature_names_out())

In [10]:
encoder_df

Unnamed: 0,color_blue,color_green,color_red
0,0.0,0.0,1.0
1,1.0,0.0,0.0
2,0.0,1.0,0.0
3,0.0,1.0,0.0
4,0.0,0.0,1.0
5,1.0,0.0,0.0


In [11]:
encoder.transform([['blue']]).toarray()



array([[1., 0., 0.]])

In [12]:
pd.concat([df,encoder_df],axis=1)

Unnamed: 0,color,color_blue,color_green,color_red
0,red,0.0,0.0,1.0
1,blue,1.0,0.0,0.0
2,green,0.0,1.0,0.0
3,green,0.0,1.0,0.0
4,red,0.0,0.0,1.0
5,blue,1.0,0.0,0.0


## Label Encoding 
### It involves assigning a unique numerical label to each category in the variable.The lables are usually assigned in alphabetical order or based on the frequency of categories.For example of category of "color" , it can be : 
### Red : 1 
### Green : 2 
### Blue : 3

In [13]:
df.head()

Unnamed: 0,color
0,red
1,blue
2,green
3,green
4,red


In [14]:
from sklearn.preprocessing import LabelEncoder
label_encoder = LabelEncoder()


In [15]:
label_encoder.fit_transform(df[['color']])

  y = column_or_1d(y, warn=True)


array([2, 0, 1, 1, 2, 0])

In [16]:
label_encoder.transform(['red'])

array([2])

In [19]:
label_encoder.transform(['blue'])

array([0])

## Ordinal Encoding
### It is used to encode categorical data that have an intrinsic order of ranking.In this, each category is assigned a numerical value based on its position in the order.For example, we have "Education Level" : 
#### High School : 1 
#### College : 2 
#### Graduate : 3
#### Post-graduate : 4

In [20]:
from sklearn.preprocessing import OrdinalEncoder
df = pd.DataFrame({
    'size' : ['small','medium','large','medium','small','large']
})

In [21]:
df

Unnamed: 0,size
0,small
1,medium
2,large
3,medium
4,small
5,large


In [23]:
encoder = OrdinalEncoder(categories=[['small','medium','large']])

In [25]:
encoder.fit_transform(df[['size']])

array([[0.],
       [1.],
       [2.],
       [1.],
       [0.],
       [2.]])

In [29]:
encoder.transform([['small']])



array([[0.]])

## Target Guided Ordinal Encoding 
#### It is a technique to encode categorical variables based on their relationship with the target variable.This technique is useful when we have a categorical varibale with a large number of unique categories and we want to use this variable as a feature in our ML model.

### In this technique, we replace each category in categorical variable with numerical value based on the mean or median of target variable for that category.This creates a monotonic relationship between categorical variable and the target variable, which can improve predictive power of our model.

In [36]:
import pandas as pd
df = pd.DataFrame({
    'city' : ['New York','London','Paris','Tokyo','New York','Paris'],
    'price' : [200,150,300,250,180,320]
})

In [37]:
df

Unnamed: 0,city,price
0,New York,200
1,London,150
2,Paris,300
3,Tokyo,250
4,New York,180
5,Paris,320


In [38]:
mean_price =df.groupby('city')['price'].mean().to_dict()

In [39]:
df['city_encoded'] = df['city'].map(mean_price)

In [41]:
df[['price','city_encoded']]

Unnamed: 0,price,city_encoded
0,200,190.0
1,150,150.0
2,300,310.0
3,250,250.0
4,180,190.0
5,320,310.0
