## Data Encoding

1. Nominal/OHE Encoding
2. Label and Ordinal Encoding
3. Target Guided Ordinal Encoding 

### Nominal/OHE Encoding
One hot encoding, also known as nominal encoding, is a technique used to represent categorical data as numerical data, which is more suitable for machine learning algorithms. In this technique, each category is represented as a binary vector where each bit corresponds to a unique category. For example, if we have a categorical variable "color" with three possible values (red, green, blue), we can represent it using one hot encoding as follows:

1. Red: [1, 0, 0]
2. Green: [0, 1, 0]
3. Blue: [0, 0, 1] 

In [2]:
from sklearn.preprocessing import OneHotEncoder
import pandas as pd

In [3]:
onehotencoder = OneHotEncoder()
df = pd.DataFrame({
    'color':['red','blue','green','green','red','blue']
})
df.head()

Unnamed: 0,color
0,red
1,blue
2,green
3,green
4,red


In [4]:
#perform fit and transform
onehotencoded=onehotencoder.fit_transform(df[['color']]).toarray()

In [5]:
encoded_df = pd.DataFrame(onehotencoded,columns=onehotencoder.get_feature_names_out())

In [6]:
encoded_df

Unnamed: 0,color_blue,color_green,color_red
0,0.0,0.0,1.0
1,1.0,0.0,0.0
2,0.0,1.0,0.0
3,0.0,1.0,0.0
4,0.0,0.0,1.0
5,1.0,0.0,0.0


In [7]:
#one hot encoding new data
onehotencoder.transform([['blue']]).toarray()



array([[1., 0., 0.]])

In [8]:
pd.concat([df,encoded_df],axis=1)

Unnamed: 0,color,color_blue,color_green,color_red
0,red,0.0,0.0,1.0
1,blue,1.0,0.0,0.0
2,green,0.0,1.0,0.0
3,green,0.0,1.0,0.0
4,red,0.0,0.0,1.0
5,blue,1.0,0.0,0.0


In [9]:
import seaborn as sns
tips_df = sns.load_dataset('tips')
tips_df.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
0,16.99,1.01,Female,No,Sun,Dinner,2
1,10.34,1.66,Male,No,Sun,Dinner,3
2,21.01,3.5,Male,No,Sun,Dinner,3
3,23.68,3.31,Male,No,Sun,Dinner,2
4,24.59,3.61,Female,No,Sun,Dinner,4


In [10]:
sex_encoder = OneHotEncoder()
smoker_encoder = OneHotEncoder()
day_encoder = OneHotEncoder()
time_encoder = OneHotEncoder()
sex_encoded = sex_encoder.fit_transform(tips_df[['sex']]).toarray()
smoker_encoded = smoker_encoder.fit_transform(tips_df[['smoker']]).toarray()
day_encoded = day_encoder.fit_transform(tips_df[['day']]).toarray()
time_encoded = time_encoder.fit_transform(tips_df[['time']]).toarray()


In [11]:
time_encoded

array([[1., 0.],
       [1., 0.],
       [1., 0.],
       [1., 0.],
       [1., 0.],
       [1., 0.],
       [1., 0.],
       [1., 0.],
       [1., 0.],
       [1., 0.],
       [1., 0.],
       [1., 0.],
       [1., 0.],
       [1., 0.],
       [1., 0.],
       [1., 0.],
       [1., 0.],
       [1., 0.],
       [1., 0.],
       [1., 0.],
       [1., 0.],
       [1., 0.],
       [1., 0.],
       [1., 0.],
       [1., 0.],
       [1., 0.],
       [1., 0.],
       [1., 0.],
       [1., 0.],
       [1., 0.],
       [1., 0.],
       [1., 0.],
       [1., 0.],
       [1., 0.],
       [1., 0.],
       [1., 0.],
       [1., 0.],
       [1., 0.],
       [1., 0.],
       [1., 0.],
       [1., 0.],
       [1., 0.],
       [1., 0.],
       [1., 0.],
       [1., 0.],
       [1., 0.],
       [1., 0.],
       [1., 0.],
       [1., 0.],
       [1., 0.],
       [1., 0.],
       [1., 0.],
       [1., 0.],
       [1., 0.],
       [1., 0.],
       [1., 0.],
       [1., 0.],
       [1., 0.],
       [1., 0.

In [12]:
sex_encoded_df = pd.DataFrame(sex_encoded,columns=sex_encoder.get_feature_names_out())
smoker_encoded_df = pd.DataFrame(smoker_encoded,columns=smoker_encoder.get_feature_names_out())
day_encoded_df = pd.DataFrame(day_encoded,columns=day_encoder.get_feature_names_out())
time_encoded_df = pd.DataFrame(time_encoded,columns=time_encoder.get_feature_names_out())
time_encoded_df.head()

Unnamed: 0,time_Dinner,time_Lunch
0,1.0,0.0
1,1.0,0.0
2,1.0,0.0
3,1.0,0.0
4,1.0,0.0


In [13]:
onehotencoded_tips_df = pd.concat([tips_df['total_bill'],tips_df['tip'],sex_encoded_df,smoker_encoded_df,day_encoded_df,time_encoded_df,tips_df['size']],axis=1)
onehotencoded_tips_df.head()


Unnamed: 0,total_bill,tip,sex_Female,sex_Male,smoker_No,smoker_Yes,day_Fri,day_Sat,day_Sun,day_Thur,time_Dinner,time_Lunch,size
0,16.99,1.01,1.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,2
1,10.34,1.66,0.0,1.0,1.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,3
2,21.01,3.5,0.0,1.0,1.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,3
3,23.68,3.31,0.0,1.0,1.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,2
4,24.59,3.61,1.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,4


### Label Encoding 
Label encoding and ordinal encoding are two techniques used to encode categorical data as numerical data.

Label encoding involves assigning a unique numerical label to each category in the variable. The labels are usually assigned in alphabetical order or based on the frequency of the categories. For example, if we have a categorical variable "color" with three possible values (red, green, blue), we can represent it using label encoding as follows:

1. Red: 1
2. Green: 2
3. Blue: 3

In Label Encoding, rankare not assigned based on numerical value since it is nominal encoding

In [14]:
df.head()

Unnamed: 0,color
0,red
1,blue
2,green
3,green
4,red


In [15]:
from sklearn.preprocessing import LabelEncoder
lbl_encoder = LabelEncoder()

In [16]:
lbl_encoded = lbl_encoder.fit_transform(df[['color']])

  y = column_or_1d(y, warn=True)


In [17]:
lbl_encoded

array([2, 0, 1, 1, 2, 0])

In [18]:
lbl_encoder.transform(['red'])

array([2])

In [19]:
lbl_encoder.transform(['blue'])

array([0])

In [20]:
lbl_encoder.transform(['green'])

array([1])

### Ordinal Encoding
It is used to encode categorical data that have an intrinsic order or ranking. In this technique, each category is assigned a numerical value based on its position in the order. For example, if we have a categorical variable "education level" with four possible values (high school, college, graduate, post-graduate), we can represent it using ordinal encoding as follows:

1. High school: 1
2. College: 2
3. Graduate: 3
4. Post-graduate: 4

In [21]:
from sklearn.preprocessing import OrdinalEncoder

In [22]:
df = pd.DataFrame({
    'size':['small','medium','large','medium','small','large']
})
df

Unnamed: 0,size
0,small
1,medium
2,large
3,medium
4,small
5,large


In [23]:
ordinalencoder = OrdinalEncoder(categories=[['small','medium','large']])
ordinalencoder.fit_transform(df[['size']])

array([[0.],
       [1.],
       [2.],
       [1.],
       [0.],
       [2.]])

## Target Guided Ordinal Encoding 
It is a technique used to encode categorical variables based on their relationship with the target variable. This encoding technique is useful when we have a categorical variable with a large number of unique categories, and we want to use this variable as a feature in our machine learning model.

In Target Guided Ordinal Encoding, we replace each category in the categorical variable with a numerical value based on the mean or median of the target variable for that category. This creates a monotonic relationship between the categorical variable and the target variable, which can improve the predictive power of our model.

In [24]:

# create a sample dataframe with a categorical variable and a target variable
df = pd.DataFrame({
    'city': ['New York', 'London', 'Paris', 'Tokyo', 'New York', 'Paris'],
    'price': [200, 150, 300, 250, 180, 320]
})
df

Unnamed: 0,city,price
0,New York,200
1,London,150
2,Paris,300
3,Tokyo,250
4,New York,180
5,Paris,320


In [25]:
mean_price=df.groupby('city')['price'].mean().to_dict()

In [26]:
df['city_encoded']=df['city'].map(mean_price)

In [27]:
df[['price','city_encoded']]

Unnamed: 0,price,city_encoded
0,200,190.0
1,150,150.0
2,300,310.0
3,250,250.0
4,180,190.0
5,320,310.0


In [28]:
df

Unnamed: 0,city,price,city_encoded
0,New York,200,190.0
1,London,150,150.0
2,Paris,300,310.0
3,Tokyo,250,250.0
4,New York,180,190.0
5,Paris,320,310.0


In [30]:
tips_df = sns.load_dataset('tips')
tips_df

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
0,16.99,1.01,Female,No,Sun,Dinner,2
1,10.34,1.66,Male,No,Sun,Dinner,3
2,21.01,3.50,Male,No,Sun,Dinner,3
3,23.68,3.31,Male,No,Sun,Dinner,2
4,24.59,3.61,Female,No,Sun,Dinner,4
...,...,...,...,...,...,...,...
239,29.03,5.92,Male,No,Sat,Dinner,3
240,27.18,2.00,Female,Yes,Sat,Dinner,2
241,22.67,2.00,Male,Yes,Sat,Dinner,2
242,17.82,1.75,Male,No,Sat,Dinner,2


In [31]:
tips_df['time'].value_counts()

time
Dinner    176
Lunch      68
Name: count, dtype: int64

In [32]:
total_bill_mean_based_on_time = tips_df.groupby('time')['total_bill'].mean().to_dict()
total_bill_mean_based_on_time

  total_bill_mean_based_on_time = tips_df.groupby('time')['total_bill'].mean().to_dict()


{'Lunch': 17.168676470588235, 'Dinner': 20.79715909090909}

In [39]:
tips_df['encoded_total_bill'] = tips_df['time'].map(total_bill_mean_based_on_time)
tips_df

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size,encoded_total_bill
0,16.99,1.01,Female,No,Sun,Dinner,2,20.797159
1,10.34,1.66,Male,No,Sun,Dinner,3,20.797159
2,21.01,3.50,Male,No,Sun,Dinner,3,20.797159
3,23.68,3.31,Male,No,Sun,Dinner,2,20.797159
4,24.59,3.61,Female,No,Sun,Dinner,4,20.797159
...,...,...,...,...,...,...,...,...
239,29.03,5.92,Male,No,Sat,Dinner,3,20.797159
240,27.18,2.00,Female,Yes,Sat,Dinner,2,20.797159
241,22.67,2.00,Male,Yes,Sat,Dinner,2,20.797159
242,17.82,1.75,Male,No,Sat,Dinner,2,20.797159
