## Data Encoding
1. Nominal/OHE Encoding
2. Label and Ordinal Encoding
3. Target Guided Ordinal Encoding

### Nominal/OHE Encoding
One hot encoding, also known as nominal encoding, is a technique used to represent categorical data as numerical data, which is more suitable for machine learning algorithms. In this technique, each category is represented as a binary vector where each bit corresponds to a unique category. For example, if we have a categorical variable "color" with three possible values (red, green, blue), we can represent it using one hot encoding as follows:

1. Red: [1, 0, 0]
2. Green: [0, 1, 0]
3. Blue: [0, 0, 1]

In [1]:
import pandas as pd 
from sklearn.preprocessing import OneHotEncoder

## Creating a dataset 

In [21]:
df = pd.DataFrame({
    'color':['red','blue','green','green','red','blue']
})

In [3]:
df.head()

Unnamed: 0,color
0,red
1,blue
2,green
3,green
4,red


In [4]:
encoder = OneHotEncoder()

In [7]:
encoded = encoder.fit_transform(df[['color']]).toarray()

In [8]:
import pandas as pd
encoded_df = pd.DataFrame(encoded,columns=encoder.get_feature_names_out())


In [10]:
encoded_df

Unnamed: 0,color_blue,color_green,color_red
0,0.0,0.0,1.0
1,1.0,0.0,0.0
2,0.0,1.0,0.0
3,0.0,1.0,0.0
4,0.0,0.0,1.0
5,1.0,0.0,0.0


In [15]:
encoder.transform([['blue']]).toarray()



array([[1., 0., 0.]])

In [16]:
df_new = pd.concat([df,encoded_df],axis=1)
df_new

Unnamed: 0,color,color_blue,color_green,color_red
0,red,0.0,0.0,1.0
1,blue,1.0,0.0,0.0
2,green,0.0,1.0,0.0
3,green,0.0,1.0,0.0
4,red,0.0,0.0,1.0
5,blue,1.0,0.0,0.0


# Assingnment

In [1]:
import seaborn as sns
dataset = sns.load_dataset('tips')
dataset.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
0,16.99,1.01,Female,No,Sun,Dinner,2
1,10.34,1.66,Male,No,Sun,Dinner,3
2,21.01,3.5,Male,No,Sun,Dinner,3
3,23.68,3.31,Male,No,Sun,Dinner,2
4,24.59,3.61,Female,No,Sun,Dinner,4


In [2]:
df = dataset

In [3]:
df['time'].value_counts()

time
Dinner    176
Lunch      68
Name: count, dtype: int64

In [11]:
from sklearn.preprocessing import OneHotEncoder
ohe = OneHotEncoder(drop='first',sparse=False)
encoded_features = ohe.fit_transform(df[['time','sex','day','smoker']])




In [12]:
encoded_features

array([[0., 0., 0., 1., 0., 0.],
       [0., 1., 0., 1., 0., 0.],
       [0., 1., 0., 1., 0., 0.],
       ...,
       [0., 1., 1., 0., 0., 1.],
       [0., 1., 1., 0., 0., 0.],
       [0., 0., 0., 0., 1., 0.]])

In [14]:
import pandas as pd
encoded_df = pd.DataFrame(encoded_features,columns=ohe.get_feature_names_out())

In [15]:
encoded_df.head()

Unnamed: 0,time_Lunch,sex_Male,day_Sat,day_Sun,day_Thur,smoker_Yes
0,0.0,0.0,0.0,1.0,0.0,0.0
1,0.0,1.0,0.0,1.0,0.0,0.0
2,0.0,1.0,0.0,1.0,0.0,0.0
3,0.0,1.0,0.0,1.0,0.0,0.0
4,0.0,0.0,0.0,1.0,0.0,0.0


In [18]:
df.drop(columns=['sex','smoker','day','time'],inplace=True)

In [19]:
df.head()

Unnamed: 0,total_bill,tip,size
0,16.99,1.01,2
1,10.34,1.66,3
2,21.01,3.5,3
3,23.68,3.31,2
4,24.59,3.61,4


In [20]:
df = pd.concat([df,encoded_df],axis=1)
df.head()

Unnamed: 0,total_bill,tip,size,time_Lunch,sex_Male,day_Sat,day_Sun,day_Thur,smoker_Yes
0,16.99,1.01,2,0.0,0.0,0.0,1.0,0.0,0.0
1,10.34,1.66,3,0.0,1.0,0.0,1.0,0.0,0.0
2,21.01,3.5,3,0.0,1.0,0.0,1.0,0.0,0.0
3,23.68,3.31,2,0.0,1.0,0.0,1.0,0.0,0.0
4,24.59,3.61,4,0.0,0.0,0.0,1.0,0.0,0.0


### Label Encoding 
Label encoding and ordinal encoding are two techniques used to encode categorical data as numerical data.

Label encoding involves assigning a unique numerical label to each category in the variable. The labels are usually assigned in alphabetical order or based on the frequency of the categories. For example, if we have a categorical variable "color" with three possible values (red, green, blue), we can represent it using label encoding as follows:

1. Red: 1
2. Green: 2
3. Blue: 3

In [22]:
df.head()

Unnamed: 0,color
0,red
1,blue
2,green
3,green
4,red


In [23]:
from sklearn.preprocessing import LabelEncoder

In [24]:
lb = LabelEncoder()

In [25]:
encoded_features = lb.fit_transform(df[['color']])

  y = column_or_1d(y, warn=True)


In [26]:
encoded_features

array([2, 0, 1, 1, 2, 0])

In [27]:
lb.classes_

array(['blue', 'green', 'red'], dtype=object)

### Ordinal Encoding
It is used to encode categorical data that have an intrinsic order or ranking. In this technique, each category is assigned a numerical value based on its position in the order. For example, if we have a categorical variable "education level" with four possible values (high school, college, graduate, post-graduate), we can represent it using ordinal encoding as follows:

1. High school: 1
2. College: 2
3. Graduate: 3
4. Post-graduate: 4

In [28]:
from sklearn.preprocessing import OrdinalEncoder


In [29]:
df = pd.DataFrame({
    'size': ['small', 'medium', 'large', 'medium', 'small', 'large']
})

In [30]:
df

Unnamed: 0,size
0,small
1,medium
2,large
3,medium
4,small
5,large


In [None]:
orend = OrdinalEncoder(categories=[['small','medium','large']])

In [32]:
orend.fit_transform(df[['size']])

array([[0.],
       [1.],
       [2.],
       [1.],
       [0.],
       [2.]])