#### Data Encoding

1. Nominal/OHE Encoding
2. Label and Ordinal Encoding
3. Target Guided Ordinal Encoding 

Data encoding means converting categorical (non-numeric) data into a numeric form so that machine learning models can process it.

##### Nominal Encoding

Used for categorical variables with no order (like color: red, blue, green).
Two common types:

1. Label Encoding:
Each category gets a number.
Example:
```
Red   → 0
Blue  → 1
Green → 2
```


⚠️ Not ideal for nominal data — models may think 2 > 1 > 0.

2.One-Hot Encoding (OHE):
Creates a binary column for each category.

Example:

Color_Red | Color_Blue | Color_Green|
|---------|------------|------------|
|    1    |       0    |       0    |

✅ Best for nominal data since it removes any implied order.
In short:
* Encoding = turning categories into numbers.
* Nominal data → use One-Hot Encoding (not Label Encoding).

In [102]:
import pandas as pd
from sklearn.preprocessing import OneHotEncoder

In [103]:
# creating a simple dataframe
df = pd.DataFrame({
    'color':['red','blue','green','green','red']
})

In [104]:
df.head()

Unnamed: 0,color
0,red
1,blue
2,green
3,green
4,red


In [105]:
## create an instance of onehotencoder
encoder=OneHotEncoder()

In [106]:
## perform fit and transform
encoded=encoder.fit_transform(df[['color']]).toarray()

In [107]:
encoder_df=pd.DataFrame(encoded,columns=encoder.get_feature_names_out())

In [108]:
encoder_df

Unnamed: 0,color_blue,color_green,color_red
0,0.0,0.0,1.0
1,1.0,0.0,0.0
2,0.0,1.0,0.0
3,0.0,1.0,0.0
4,0.0,0.0,1.0


In [109]:
## for new data 
encoder.transform([['blue']]).toarray()




array([[1., 0., 0.]])

In [110]:
pd.concat([df,encoder_df],axis=1)

Unnamed: 0,color,color_blue,color_green,color_red
0,red,0.0,0.0,1.0
1,blue,1.0,0.0,0.0
2,green,0.0,1.0,0.0
3,green,0.0,1.0,0.0
4,red,0.0,0.0,1.0


In [111]:
import seaborn as sns
df1=sns.load_dataset('tips')

In [112]:
df1.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
0,16.99,1.01,Female,No,Sun,Dinner,2
1,10.34,1.66,Male,No,Sun,Dinner,3
2,21.01,3.5,Male,No,Sun,Dinner,3
3,23.68,3.31,Male,No,Sun,Dinner,2
4,24.59,3.61,Female,No,Sun,Dinner,4


In [113]:
# we have a sex, smoker , day , time is categroical data
# so we change into one hat encoding

df1['sex'].value_counts()
df1['smoker'].value_counts()
df1['day'].value_counts()
df1['time'].value_counts()

time
Dinner    176
Lunch      68
Name: count, dtype: int64

In [114]:
encoder=OneHotEncoder()

In [115]:
## to perform fit and transform
encoded_data=encoder.fit_transform(df1[['sex','smoker','day','time']]).toarray()

In [116]:
encoder_df=pd.DataFrame(encoded_data,columns=encoder.get_feature_names_out())

In [117]:
encoder_df

Unnamed: 0,sex_Female,sex_Male,smoker_No,smoker_Yes,day_Fri,day_Sat,day_Sun,day_Thur,time_Dinner,time_Lunch
0,1.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0
1,0.0,1.0,1.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0
2,0.0,1.0,1.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0
3,0.0,1.0,1.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0
4,1.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0
...,...,...,...,...,...,...,...,...,...,...
239,0.0,1.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0
240,1.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0
241,0.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0
242,0.0,1.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0


## Lable encoding

In [118]:
df.head()

Unnamed: 0,color
0,red
1,blue
2,green
3,green
4,red


In [120]:
from sklearn.preprocessing import LabelEncoder
lbl_encoder=LabelEncoder()

In [121]:
lbl_encoder.fit_transform(df[['color']])

  y = column_or_1d(y, warn=True)


array([2, 0, 1, 1, 2])

In [122]:
lbl_encoder.transform([['red']])

  y = column_or_1d(y, dtype=self.classes_.dtype, warn=True)


array([2])

In [124]:
lbl_encoder.transform([['blue']])

  y = column_or_1d(y, dtype=self.classes_.dtype, warn=True)


array([0])

In [125]:
lbl_encoder.transform([['green']])

  y = column_or_1d(y, dtype=self.classes_.dtype, warn=True)


array([1])

### Ordinal encoding is used in feature engineering to convert categorical variables with an inherent order (e.g., “low,” “medium,” “high”) into numerical values that machine learning models can understand.

In short:

* It assigns each category an integer based on its order (e.g., low = 1,  medium = 2, high = 3).
* This preserves the ranking relationship between categories.
* It’s especially useful for models that can interpret order (like linear regression or tree-based models).

✅ Use ordinal encoding when: the categories have a natural order. 

❌ Avoid it when: the categories are nominal (no order, like colors or cities).

### Ordinal Encoding
It is used to encode categorical data that have an intrinsic order or ranking. In this technique, each category is assigned a numerical value based on its position in the order. For example, if we have a categorical variable "education level" with four possible values (high school, college, graduate, post-graduate), we can represent it using ordinal encoding as follows:

1. High school: 1
2. College: 2
3. Graduate: 3
4. Post-graduate: 4

In [127]:
# Ordinal encoding
from sklearn.preprocessing import OrdinalEncoder

In [128]:
## creating a simple dataframe with an ordina; variable

df=pd.DataFrame({
    'size':['small','medium','large','medium','small','large']
})

In [129]:
df

Unnamed: 0,size
0,small
1,medium
2,large
3,medium
4,small
5,large


In [130]:
## assing rank 
## create an instance of ordinalencoder and then fit_transform
encoder=OrdinalEncoder(categories=[['small','medium','large']])

In [131]:
encoder.fit_transform(df[['size']])

array([[0.],
       [1.],
       [2.],
       [1.],
       [0.],
       [2.]])

In [132]:
encoder.transform(df[['size']])

array([[0.],
       [1.],
       [2.],
       [1.],
       [0.],
       [2.]])