## Data Encoding

1. Nominal/OHE Encoding
2. Label and Ordinal Encoding
3. Target Guided Ordinal Encoding 

https://www.geeksforgeeks.org/machine-learning/categorical-data-encoding-techniques-in-machine-learning/

### Encoding:
Encoding transforms raw input data into a numerical representation that a machine learning model can understand and process. This process aims to capture the essential information and features of the input while potentially reducing dimensionality or converting categorical data into a numerical format.
1. Feature Encoding: For categorical data, techniques like one-hot encoding, label encoding, or ordinal encoding convert categories into numerical values. For example, in natural language processing (NLP), words or characters are often encoded into numerical vectors (e.g., word embeddings).
2. Encoder in Encoder-Decoder Architectures: In these architectures, the encoder is a neural network (e.g., RNN, LSTM, GRU, Transformer) that takes an input sequence (e.g., a sentence in one language) and processes it to produce a fixed-size numerical representation, often called a "context vector" or "hidden state." This vector encapsulates the information from the entire input sequence.

### Decoding:
Decoding is the reverse process of encoding. It takes the numerical output from a model and transforms it back into a human-readable or interpretable format.
Feature Decoding: If categorical data was encoded, decoding would involve mapping numerical predictions back to their original categories.
1. Decoder in Encoder-Decoder Architectures: The decoder is another neural network that takes the encoded representation (the context vector from the encoder) and generates an output sequence (e.g., a translated sentence in another language). The decoder typically generates the output sequence step-by-step, using its current state and previously generated outputs to predict the next element in the sequence.
Examples of Encoding and Decoding in ML:
2. Machine Translation: An encoder processes an English sentence into a numerical representation, and a decoder then translates this into a French sentence.
3. Text Summarization: An encoder compresses a long text into a concise representation, and a decoder generates a summary from this representation.
4. Image Captioning: An encoder extracts features from an image, and a decoder generates a textual description (caption) of the image.
5. Autoencoders: These are neural networks designed to learn efficient data encodings. An encoder maps input data to a lower-dimensional latent space, and a decoder reconstructs the original input from this latent representation.

### Nominal/OHE Encoding
One hot encoding, also known as nominal encoding, is a technique used to represent categorical data as numerical data, which is more suitable for machine learning algorithms. In this technique, each category is represented as a binary vector where each bit corresponds to a unique category. For example, if we have a categorical variable "color" with three possible values (red, green, blue), we can represent it using one hot encoding as follows:

1. Red: [1, 0, 0]
2. Green: [0, 1, 0]
3. Blue: [0, 0, 1]

In [1]:
import pandas as pd
from sklearn.preprocessing import OneHotEncoder

In [2]:
## Create a simple dataframe 
df = pd.DataFrame({
    'color': ['red', 'blue', 'green', 'green', 'red', 'blue']
})

In [3]:
df.head()

Unnamed: 0,color
0,red
1,blue
2,green
3,green
4,red


In [4]:
##create an instance of Onehotencoder
encoder=OneHotEncoder()

In [5]:
## perform fit and transform
encoded=encoder.fit_transform(df[['color']]).toarray()

In [6]:
import pandas as pd
encoder_df=pd.DataFrame(encoded,columns=encoder.get_feature_names_out())

In [7]:
encoder_df

Unnamed: 0,color_blue,color_green,color_red
0,0.0,0.0,1.0
1,1.0,0.0,0.0
2,0.0,1.0,0.0
3,0.0,1.0,0.0
4,0.0,0.0,1.0
5,1.0,0.0,0.0


In [8]:
## for new data
encoder.transform([['blue']]).toarray()



array([[1., 0., 0.]])

In [9]:
pd.concat([df,encoder_df],axis=1)

Unnamed: 0,color,color_blue,color_green,color_red
0,red,0.0,0.0,1.0
1,blue,1.0,0.0,0.0
2,green,0.0,1.0,0.0
3,green,0.0,1.0,0.0
4,red,0.0,0.0,1.0
5,blue,1.0,0.0,0.0


In [10]:
import seaborn as sns
sns.load_dataset('tips')

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
0,16.99,1.01,Female,No,Sun,Dinner,2
1,10.34,1.66,Male,No,Sun,Dinner,3
2,21.01,3.50,Male,No,Sun,Dinner,3
3,23.68,3.31,Male,No,Sun,Dinner,2
4,24.59,3.61,Female,No,Sun,Dinner,4
...,...,...,...,...,...,...,...
239,29.03,5.92,Male,No,Sat,Dinner,3
240,27.18,2.00,Female,Yes,Sat,Dinner,2
241,22.67,2.00,Male,Yes,Sat,Dinner,2
242,17.82,1.75,Male,No,Sat,Dinner,2


### Label Encoding 
Label encoding and ordinal encoding are two techniques used to encode categorical data as numerical data.

Label encoding involves assigning a unique numerical label to each category in the variable. The labels are usually assigned in alphabetical order or based on the frequency of the categories. For example, if we have a categorical variable "color" with three possible values (red, green, blue), we can represent it using label encoding as follows:

1. Red: 1
2. Green: 2
3. Blue: 3

In [11]:
df.head()

Unnamed: 0,color
0,red
1,blue
2,green
3,green
4,red


In [12]:
from sklearn.preprocessing import LabelEncoder
lbl_encoder=LabelEncoder()

In [13]:
lbl_encoder.fit_transform(df[['color']])

  y = column_or_1d(y, warn=True)


array([2, 0, 1, 1, 2, 0])

In [14]:
lbl_encoder.transform([['red']])

  y = column_or_1d(y, dtype=self.classes_.dtype, warn=True)


array([2])

In [15]:
lbl_encoder.transform([['blue']])

  y = column_or_1d(y, dtype=self.classes_.dtype, warn=True)


array([0])

In [16]:
lbl_encoder.transform([['green']])

  y = column_or_1d(y, dtype=self.classes_.dtype, warn=True)


array([1])

### Ordinal Encoding
It is used to encode categorical data that have an intrinsic order or ranking. In this technique, each category is assigned a numerical value based on its position in the order. For example, if we have a categorical variable "education level" with four possible values (high school, college, graduate, post-graduate), we can represent it using ordinal encoding as follows:

1. High school: 1
2. College: 2
3. Graduate: 3
4. Post-graduate: 4

In [17]:
## ORdinal Encoding
from sklearn.preprocessing import OrdinalEncoder

In [18]:
# create a sample dataframe with an ordinal variable
df = pd.DataFrame({
    'size': ['small', 'medium', 'large', 'medium', 'small', 'large']
})

In [19]:
df

Unnamed: 0,size
0,small
1,medium
2,large
3,medium
4,small
5,large


In [20]:
## create an instance of ORdinalEncoder and then fit_transform
encoder=OrdinalEncoder(categories=[['small','medium','large']])

In [21]:
encoder.fit_transform(df[['size']])

array([[0.],
       [1.],
       [2.],
       [1.],
       [0.],
       [2.]])

In [22]:
encoder.transform([['small']])



array([[0.]])

In [23]:
encoder.transform([['large']])



array([[2.]])

In [24]:
encoder.transform([['medium']])



array([[1.]])