## Data encoding in machine learning 
is the process of converting categorical data into numerical formats that algorithms can understand.

## Types of Data Encoding Techniques :

### Nominal encoding 
is the process of converting nominal (categorical) data without any intrinsic order into numerical representations suitable for machine learning models. Common methods include:

In [4]:
# To exxplain Nominal encoding Lets use Iris Datase
from sklearn.datasets import load_iris
import pandas as pd
import numpy as np

In [5]:
iris = load_iris()
iris.keys()

dict_keys(['data', 'target', 'frame', 'target_names', 'DESCR', 'feature_names', 'filename', 'data_module'])

In [6]:

df_iris = pd.DataFrame(iris.data, columns = iris.feature_names)

In [7]:
df_iris

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm)
0,5.1,3.5,1.4,0.2
1,4.9,3.0,1.4,0.2
2,4.7,3.2,1.3,0.2
3,4.6,3.1,1.5,0.2
4,5.0,3.6,1.4,0.2
...,...,...,...,...
145,6.7,3.0,5.2,2.3
146,6.3,2.5,5.0,1.9
147,6.5,3.0,5.2,2.0
148,6.2,3.4,5.4,2.3


In [8]:
iris_col = np.array(iris.feature_names)

In [9]:
iris_col

array(['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)',
       'petal width (cm)'], dtype='<U17')

In [10]:
df_iris_col = pd.DataFrame({"Columns":iris_col})

In [11]:
df_iris_col

Unnamed: 0,Columns
0,sepal length (cm)
1,sepal width (cm)
2,petal length (cm)
3,petal width (cm)


### So here we can see theres no order of arranging the columns in vertical manner 
### So we decide to use nominal encoding here 

In [13]:
from sklearn.preprocessing import OneHotEncoder

In [14]:
N_encoder = OneHotEncoder()

In [15]:
N_encoder

In [16]:
df_iris_col2 = N_encoder.fit_transform(df_iris_col[["Columns"]]).toarray()

In [17]:
df_iris_col2

array([[0., 0., 1., 0.],
       [0., 0., 0., 1.],
       [1., 0., 0., 0.],
       [0., 1., 0., 0.]])

In [18]:
encoded_df = pd.DataFrame(df_iris_col2, columns = N_encoder.get_feature_names_out())

In [19]:
print(encoded_df)

   Columns_petal length (cm)  Columns_petal width (cm)  \
0                        0.0                       0.0   
1                        0.0                       0.0   
2                        1.0                       0.0   
3                        0.0                       1.0   

   Columns_sepal length (cm)  Columns_sepal width (cm)  
0                        1.0                       0.0  
1                        0.0                       1.0  
2                        0.0                       0.0  
3                        0.0                       0.0  


# 13.00

## Interpretation 
- So basically the values of columns from df_iris_col will be transposed into diffrent columns againts the "Columns" of the 
df_iris_col itself in encoded table.
- Value 1 is for true Value 0 is for fasle condition.

## Note : 
- if there are n dimensions / features / categories in a data , only n-1 will make sense

In [22]:
df_iris_col2

array([[0., 0., 1., 0.],
       [0., 0., 0., 1.],
       [1., 0., 0., 0.],
       [0., 1., 0., 0.]])

In [23]:
df_iris_col

Unnamed: 0,Columns
0,sepal length (cm)
1,sepal width (cm)
2,petal length (cm)
3,petal width (cm)


In [24]:
encoded_df

Unnamed: 0,Columns_petal length (cm),Columns_petal width (cm),Columns_sepal length (cm),Columns_sepal width (cm)
0,0.0,0.0,1.0,0.0
1,0.0,0.0,0.0,1.0
2,1.0,0.0,0.0,0.0
3,0.0,1.0,0.0,0.0


In [25]:
encoded_df["Column-name"] = df_iris_col['Columns']

In [26]:
encoded_df

Unnamed: 0,Columns_petal length (cm),Columns_petal width (cm),Columns_sepal length (cm),Columns_sepal width (cm),Column-name
0,0.0,0.0,1.0,0.0,sepal length (cm)
1,0.0,0.0,0.0,1.0,sepal width (cm)
2,1.0,0.0,0.0,0.0,petal length (cm)
3,0.0,1.0,0.0,0.0,petal width (cm)


# 16:51

## Label Encoding
Label Encoding is a technique in machine learning used to convert categorical data into numerical form by assigning a unique integer to each category.

In [29]:
from sklearn.preprocessing import LabelEncoder

In [30]:
L_encoder = LabelEncoder()
L_encoder

In [31]:
L_encoded_df = L_encoder.fit_transform(df_iris_col['Columns'])
L_encoded_df = pd.DataFrame(L_encoded_df)
L_encoded_df["Column_name"] = df_iris_col["Columns"]
L_encoded_df

Unnamed: 0,0,Column_name
0,2,sepal length (cm)
1,3,sepal width (cm)
2,0,petal length (cm)
3,1,petal width (cm)


- Advantages: Simple and effective for ordinal data where the order matters.
- Disadvantages: For nominal data (no inherent order), it may introduce a sense of hierarchy, which can mislead certain algorithms.
### Note: 
For non-ordinal data, consider using One-Hot Encoding instead.

# 24:15

## Ordinal Encoding
Ordinal Encoding is a technique used to convert categorical data into numerical values while preserving the order or hierarchy among the categories.

In [35]:
from sklearn.preprocessing import OrdinalEncoder

In [36]:
O_encoder = OrdinalEncoder(categories=[["sepal length (cm)","sepal width (cm)","petal length (cm)","petal width (cm)"]])

In [37]:
O_encoder

In [38]:
O_encoded_df = O_encoder.fit_transform(df_iris_col[['Columns']])
O_encoded_df = pd.DataFrame(O_encoded_df)

In [39]:
O_encoded_df["Column_name"] = df_iris_col['Columns']
O_encoded_df

Unnamed: 0,0,Column_name
0,0.0,sepal length (cm)
1,1.0,sepal width (cm)
2,2.0,petal length (cm)
3,3.0,petal width (cm)


- Advantages: Maintains the order of categories, which is critical for ordinal data.
- Disadvantages: Assumes equal spacing between categories, which might not always be true.
Use Ordinal Encoding only when the categories have a meaningful sequence.

## Target Guided Ordinal Encoding 
Target Guided Ordinal Encoding is a technique that assigns ordinal values to categories based on the relationship between each category and the target variable. It is typically used in supervised learning tasks.

In [42]:
df_iris

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm)
0,5.1,3.5,1.4,0.2
1,4.9,3.0,1.4,0.2
2,4.7,3.2,1.3,0.2
3,4.6,3.1,1.5,0.2
4,5.0,3.6,1.4,0.2
...,...,...,...,...
145,6.7,3.0,5.2,2.3
146,6.3,2.5,5.0,1.9
147,6.5,3.0,5.2,2.0
148,6.2,3.4,5.4,2.3


In [43]:
Sum_value = np.array(df_iris.sum())

In [44]:
df_iris_col["Sums"] = Sum_value 

In [45]:
type(df_iris_col["Sums"])
df_iris_col["Sums"] = df_iris_col["Sums"].astype('int64')
df_iris_col

Unnamed: 0,Columns,Sums
0,sepal length (cm),876
1,sepal width (cm),458
2,petal length (cm),563
3,petal width (cm),179


In [109]:
# Sort the rows by the "Sums" column in descending order
sorted_df = df_iris_col.sort_values(by="Sums", ascending=False)

sorted_df

Unnamed: 0,Columns,Sums
0,sepal length (cm),876
2,petal length (cm),563
1,sepal width (cm),458
3,petal width (cm),179


In [123]:
# Reset the index after sorting
sorted_df = sorted_df.reset_index(drop=True)

# Create the 'Encodes' column
sorted_df["Encodes"] = sorted_df.index

sorted_df

Unnamed: 0,Columns,Sums,Encodes
0,sepal length (cm),876,0
1,petal length (cm),563,1
2,sepal width (cm),458,2
3,petal width (cm),179,3


# this method can be used as an alternative to Target Guided Ordinal Encoding