## Data Encoding
We will be converting categorical variables into some meaningful numerical values for the model to function properly

1. Nominal/OHE Encoding
2. Label and Ordinal Encoding
3. Target Guided Ordinal Encoding 

### Nominal/OHE Encoding
One hot encoding, also known as nominal encoding, is a technique used to represent categorical data as numerical data, which is more suitable for machine learning algorithms. In this technique, each category is represented as a binary vector where each bit corresponds to a unique category. For example, if we have a categorical variable "color" with three possible values (red, green, blue), we can represent it using one hot encoding as follows:

### Disadvantage

>If we many many categories in a dataset then we should not do OHE as it creates many many columns

>Sparse matrix is the 1,0 matrix leads to overfitting in the dataset

1. Red: [1, 0, 0]
2. Green: [0, 1, 0]
3. Blue: [0, 0, 1]

In [1]:
import pandas as pd
from sklearn.preprocessing import OneHotEncoder

In [2]:
## Create a simple dataframe 
df = pd.DataFrame({
    'color': ['red', 'blue', 'green', 'green', 'red', 'blue']
})

In [3]:
df.head()

Unnamed: 0,color
0,red
1,blue
2,green
3,green
4,red


In [4]:
##create an instance of One_hot_encoder
encoder=OneHotEncoder()

In [5]:
## perform fit and transform
encoded=encoder.fit_transform(df[['color']]).toarray()

In [6]:
import pandas as pd
encoder_df=pd.DataFrame(encoded,columns=encoder.get_feature_names_out())

In [7]:
encoder_df

Unnamed: 0,color_blue,color_green,color_red
0,0.0,0.0,1.0
1,1.0,0.0,0.0
2,0.0,1.0,0.0
3,0.0,1.0,0.0
4,0.0,0.0,1.0
5,1.0,0.0,0.0


In [8]:
## for new data
encoder.transform([['blue']]).toarray()



array([[1., 0., 0.]])

In [9]:
pd.concat([df,encoder_df],axis=1)

Unnamed: 0,color,color_blue,color_green,color_red
0,red,0.0,0.0,1.0
1,blue,1.0,0.0,0.0
2,green,0.0,1.0,0.0
3,green,0.0,1.0,0.0
4,red,0.0,0.0,1.0
5,blue,1.0,0.0,0.0


In [44]:
import seaborn as sns
df=sns.load_dataset('tips')
df.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
0,16.99,1.01,Female,No,Sun,Dinner,2
1,10.34,1.66,Male,No,Sun,Dinner,3
2,21.01,3.5,Male,No,Sun,Dinner,3
3,23.68,3.31,Male,No,Sun,Dinner,2
4,24.59,3.61,Female,No,Sun,Dinner,4


In [17]:
# Now we will perform OHE on the day  feature
df['day'].unique()

# now we can see that day has 4 categories
# therefore we will create an instance
encoder=OneHotEncoder()

encoded=encoder.fit_transform(df[['day']]).toarray()
encoder_df=pd.DataFrame(encoded,columns=encoder.get_feature_names_out())
encoder_df

pd.concat([df['day'],encoder_df],axis=1)

Unnamed: 0,day,day_Fri,day_Sat,day_Sun,day_Thur
0,Sun,0.0,0.0,1.0,0.0
1,Sun,0.0,0.0,1.0,0.0
2,Sun,0.0,0.0,1.0,0.0
3,Sun,0.0,0.0,1.0,0.0
4,Sun,0.0,0.0,1.0,0.0
...,...,...,...,...,...
239,Sat,0.0,1.0,0.0,0.0
240,Sat,0.0,1.0,0.0,0.0
241,Sat,0.0,1.0,0.0,0.0
242,Sat,0.0,1.0,0.0,0.0


### Label Encoding 
Label encoding and ordinal encoding are two techniques used to encode categorical data as numerical data.

Label encoding involves assigning a unique numerical label to each category in the variable. The labels are usually assigned in alphabetical order or based on the frequency of the categories. For example, if we have a categorical variable "color" with three possible values (red, green, blue), we can represent it using label encoding as follows:

### Disadvantages
As we assign unique numerical values ranging from 0 to n our machine learning model will treat them as priority meaning that one category is greater than the rest leading to ambiguity in the result
> It assigns ranks leading to comparison in nominal values

1. Friday: 0
2. Saturday: 1
3. Sunday: 2
4. Thursday: 3

In [45]:
df.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
0,16.99,1.01,Female,No,Sun,Dinner,2
1,10.34,1.66,Male,No,Sun,Dinner,3
2,21.01,3.5,Male,No,Sun,Dinner,3
3,23.68,3.31,Male,No,Sun,Dinner,2
4,24.59,3.61,Female,No,Sun,Dinner,4


In [39]:
from sklearn.preprocessing import LabelEncoder
lbl_encoder=LabelEncoder()

In [48]:
df_1=pd.DataFrame(lbl_encoder.fit_transform(df[['day']]))

  y = column_or_1d(y, warn=True)


In [49]:
pd.concat([df['day'],df_1],axis=1)

Unnamed: 0,day,0
0,Sun,2
1,Sun,2
2,Sun,2
3,Sun,2
4,Sun,2
...,...,...
239,Sat,1
240,Sat,1
241,Sat,1
242,Sat,1


In [50]:
lbl_encoder.transform([['Sat']])

  y = column_or_1d(y, dtype=self.classes_.dtype, warn=True)


array([1])

In [26]:
lbl_encoder.transform([['Sun']])

  y = column_or_1d(y, dtype=self.classes_.dtype, warn=True)


array([2])

In [37]:
lbl_encoder.transform([['Fri']])

  y = column_or_1d(y, dtype=self.classes_.dtype, warn=True)


array([0])

### Ordinal Encoding
It is used to encode categorical data that have an intrinsic order or ranking. In this technique, each category is assigned a numerical value based on its position in the order. For example, if we have a categorical variable "education level" with four possible values (high school, college, graduate, post-graduate), we can represent it using ordinal encoding as follows:

1. High school: 1
2. College: 2
3. Graduate: 3
4. Post-graduate: 4

In [51]:
## Ordinal Encoding
from sklearn.preprocessing import OrdinalEncoder

In [52]:
# create a sample dataframe with an ordinal variable
df = pd.DataFrame({
    'size': ['small', 'medium', 'large', 'medium', 'small', 'large']
})

In [55]:
df

Unnamed: 0,size
0,small
1,medium
2,large
3,medium
4,small
5,large


In [60]:
## create an instance of OrdinalEncoder and then fit_transform
encoder=OrdinalEncoder(categories=[['small','medium','large']])
# the manner we insert the values in thw categories we give the ranks from lower to higher
# therefore
# small=0
# medium=1
# large=2

In [61]:
encoder.fit_transform(df[['size']])

array([[0.],
       [1.],
       [2.],
       [1.],
       [0.],
       [2.]])

In [62]:
encoder.transform([['small']])



array([[0.]])