# Data Encoding

##### **Data encoding** is the process of converting categorical data (text or labels) into numerical form so that machine learning models can 
##### understand and process it. 

### One Hot Encoding
##### **One-Hot Encoding** is a technique that converts categorical variables into a series of binary (0 or 1) columns, where each category becomes its
##### own column. For each row, only the column corresponding to its category is set to 1, while others remain 0. This method is ideal for data with no 
##### inherent order, helping prevent the model from assuming any numerical relationship between categories.

In [4]:
import pandas as pd
from sklearn.preprocessing import OneHotEncoder

In [5]:
## Create a simple dataframe
df = pd.DataFrame({
    'color' : ['Red', 'Blue', 'Green','Green', 'Red', 'Blue']
})

In [6]:
df.head()

Unnamed: 0,color
0,Red
1,Blue
2,Green
3,Green
4,Red


In [7]:
# create an instance of OneHotEncoder
encoder = OneHotEncoder()

In [8]:
# perform fit and transform, it will create sparse matrix
encoder.fit_transform(df[['color']]).toarray()

array([[0., 0., 1.],
       [1., 0., 0.],
       [0., 1., 0.],
       [0., 1., 0.],
       [0., 0., 1.],
       [1., 0., 0.]])

In [9]:
encoded=encoder.fit_transform(df[['color']]).toarray()

In [10]:
import pandas as pd
encoder_df = pd.DataFrame(encoded,columns=encoder.get_feature_names_out())

In [11]:
encoder_df 

Unnamed: 0,color_Blue,color_Green,color_Red
0,0.0,0.0,1.0
1,1.0,0.0,0.0
2,0.0,1.0,0.0
3,0.0,1.0,0.0
4,0.0,0.0,1.0
5,1.0,0.0,0.0


In [12]:
pd.concat([df,encoder_df],axis=1)

Unnamed: 0,color,color_Blue,color_Green,color_Red
0,Red,0.0,0.0,1.0
1,Blue,1.0,0.0,0.0
2,Green,0.0,1.0,0.0
3,Green,0.0,1.0,0.0
4,Red,0.0,0.0,1.0
5,Blue,1.0,0.0,0.0


### Label Encoding

##### **Label encoding** is a technique that assigns a unique integer to each category in a categorical variable. It’s useful for categorical data 
##### without any intrinsic order, but it may not be ideal if the algorithm assumes a ranking or relationship between these numerical labels.

In [13]:
df.head()

Unnamed: 0,color
0,Red
1,Blue
2,Green
3,Green
4,Red


In [14]:
from sklearn.preprocessing import LabelEncoder

In [15]:
label_encoder = LabelEncoder()

In [16]:
label_encoder.fit_transform(df['color'])

array([2, 0, 1, 1, 2, 0])

In [20]:
#checking with new value
label_encoder.transform([['Green']])

  y = column_or_1d(y, dtype=self.classes_.dtype, warn=True)


array([1])

In [21]:
label_encoder.transform([['Blue']])

  y = column_or_1d(y, dtype=self.classes_.dtype, warn=True)


array([0])

### Ordinal Encoding

##### **Ordinal encoding** is a method of converting categorical data into numbers by assigning a unique integer to each category based on a specific 
##### order or ranking. This approach is useful for ordinal data, where the categories have a natural order (e.g., "Low," "Medium," "High").

It is used to encode categorical data that have an intrinsic order or ranking.
In this technique, each category is assigned a numerical value based on its position 
in the order.
E.g., if we have a categorical variable 'education level' with four possible values(
high school, college, graduate, post-graduate), we can present it using ordinal encoding
as follows:


High school : 1

College : 2

Graduate : 3

Post-graduate : 4


In [22]:
## Ordinal Encoding
from sklearn.preprocessing import OrdinalEncoder


In [26]:
df = pd.DataFrame({
    'size' : ['small', 'medium', 'large', 'medium', 'small', 'large']
})

In [27]:
df

Unnamed: 0,size
0,small
1,medium
2,large
3,medium
4,small
5,large


In [28]:
Ord_encoder = OrdinalEncoder(categories=[['small','medium','large']])

In [29]:
Ord_encoder.fit_transform(df[['size']])

array([[0.],
       [1.],
       [2.],
       [1.],
       [0.],
       [2.]])

In [30]:
Ord_encoder.transform([['small']])



array([[0.]])

In [31]:
Ord_encoder.transform([['large']])



array([[2.]])

### Target Guided Ordinal Encoding

##### **Target Guided Ordinal Encoding** technique used to encode categorical variavles based on their relationship with the target variable.

##### In this technique, we replace each category in the categorical variable with a numerical value based on the mean or median of the target variable
##### for that category.

In [1]:
import pandas as pd

#creating a simple dataframe with a categorical variable and a target variable
df = pd.DataFrame({
    'city' : ['New York', 'London', 'Paris', 'Tokyo', 'New York', 'Paris'],
    'price' : [200,150,300,250,180,320]
})


In [2]:
df

Unnamed: 0,city,price
0,New York,200
1,London,150
2,Paris,300
3,Tokyo,250
4,New York,180
5,Paris,320


In [5]:
mean_price=df.groupby('city')['price'].mean().to_dict()

In [6]:
mean_price

{'London': 150.0, 'New York': 190.0, 'Paris': 310.0, 'Tokyo': 250.0}

In [7]:
df['city_encoded'] = df['city'].map(mean_price)

In [9]:
df #['city_encoded'] 

Unnamed: 0,city,price,city_encoded
0,New York,200,190.0
1,London,150,150.0
2,Paris,300,310.0
3,Tokyo,250,250.0
4,New York,180,190.0
5,Paris,320,310.0


In [10]:
df[['city_encoded','price']] 

Unnamed: 0,city_encoded,price
0,190.0,200
1,150.0,150
2,310.0,300
3,250.0,250
4,190.0,180
5,310.0,320
