### Data Encoding

1. Nominal / OHE Encoding
2. Label and Ordinal Encoding
3. Target Guided Ordinal Encoding

Suppose we have a dataset which has three features - Experience and Degree, based on which we need to predict the salary.   
Aim of the model is to predict salary based on  the two features.   
Degree can be BTech, MTech, etc and are categorical variables.  
Model won't be to understand it, it only understands numerical values.  
Thus we must try to convert the categorical values into some meaningful numerical value.    
This process is called Data Encoding.   
We convert categorical data to numerical data so that the model will be able to understand.

#### Nominal / OHE Encoding

One hot encoding, also known as nominal encoding, is a technique used to represent categorical data as numerical data, which is more suitable for machine learning algorithms.  
In this technique, each category is represented as a binary vector where each bit corresponds to a unique category.      

For example, if we have a categorical variable "color" with three possible values (red,green,blue), we can represent it using one hot encoding as follows:  
1. Red [1,0,0]
2. Blue [0,1,0]
3. Green [0,0,1]    

The Red, Green and Blue will become three features. Wherever the data belongs to a particular color, that will be 1 while others are 0. 

Disadvantage: 
1. If we have too many categorical values, too many features will be created. Suppose if we have 100 categorical values, 100 new features would have to be created.
2. Sparse Matrix: A sparse matrix is a matrix that contains mostly zeros. Instead of storing all values—including the zeroes—it stores only the non-zero elements and their positions. This may lead to overfitting when the model gets trained very well with the training data.

In [4]:
import pandas as pd
from sklearn.preprocessing import OneHotEncoder

In [5]:
## Creating a Simple DataFrame for Example
df = pd.DataFrame({
    'color': ['red','blue','green','green','red','blue']
})

In [6]:
df.head()

Unnamed: 0,color
0,red
1,blue
2,green
3,green
4,red


In [7]:
## First we create an instance of OneHotEncoder
encoder = OneHotEncoder()

In [8]:
## Perform fit and then transform
encoder.fit_transform(df[['color']]) #This creates a sparse matrix
#- Single brackets (df['color']): Returns a Series — essentially a 1D labeled array.
#- Double brackets (df[['color']]): Returns a DataFrame — the 2D structure, even if it’s only one column.

<Compressed Sparse Row sparse matrix of dtype 'float64'
	with 6 stored elements and shape (6, 3)>

In [9]:
encoder.fit_transform(df[['color']]).toarray()
# We can now see the values
# The features have been segregated alphabetically so blue becomes first feature followed by green and then red
# So whenever we have blue in dataset, it is represented by 100, green by 010 and red by 001

array([[0., 0., 1.],
       [1., 0., 0.],
       [0., 1., 0.],
       [0., 1., 0.],
       [0., 0., 1.],
       [1., 0., 0.]])

In [10]:
encoded = encoder.fit_transform(df[['color']]).toarray()

In [11]:
encoder_df = pd.DataFrame(encoded,columns=encoder.get_feature_names_out())
# encoder.get_feature_names_out() gets all the feature names

In [12]:
encoder_df

Unnamed: 0,color_blue,color_green,color_red
0,0.0,0.0,1.0
1,1.0,0.0,0.0
2,0.0,1.0,0.0
3,0.0,1.0,0.0
4,0.0,0.0,1.0
5,1.0,0.0,0.0


In [13]:
## For new data
encoder.transform([['blue']]).toarray()



array([[1., 0., 0.]])

In [14]:
pd.concat([df,encoder_df],axis=1)

Unnamed: 0,color,color_blue,color_green,color_red
0,red,0.0,0.0,1.0
1,blue,1.0,0.0,0.0
2,green,0.0,1.0,0.0
3,green,0.0,1.0,0.0
4,red,0.0,0.0,1.0
5,blue,1.0,0.0,0.0


In [15]:
## Assignment
import seaborn as sns
tips = sns.load_dataset('tips')
tips
# sex, smoker, day, time are categorical features

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
0,16.99,1.01,Female,No,Sun,Dinner,2
1,10.34,1.66,Male,No,Sun,Dinner,3
2,21.01,3.50,Male,No,Sun,Dinner,3
3,23.68,3.31,Male,No,Sun,Dinner,2
4,24.59,3.61,Female,No,Sun,Dinner,4
...,...,...,...,...,...,...,...
239,29.03,5.92,Male,No,Sat,Dinner,3
240,27.18,2.00,Female,Yes,Sat,Dinner,2
241,22.67,2.00,Male,Yes,Sat,Dinner,2
242,17.82,1.75,Male,No,Sat,Dinner,2


In [16]:
type(tips)

pandas.core.frame.DataFrame

In [25]:
encoder = OneHotEncoder()

sex_encoded = encoder.fit_transform(tips[['sex']]).toarray()
sex_encoder_df = pd.DataFrame(sex_encoded, columns=encoder.get_feature_names_out())
print(encoder.transform([['Male']]).toarray())

day_encoded = encoder.fit_transform(tips[['day']]).toarray()
day_encoder_df = pd.DataFrame(day_encoded, columns=encoder.get_feature_names_out())
print(encoder.transform([['Sat']]).toarray())

time_encoded = encoder.fit_transform(tips[['time']]).toarray()
time_encoder_df = pd.DataFrame(time_encoded, columns=encoder.get_feature_names_out())
print(encoder.transform([['Dinner']]).toarray())

[[0. 1.]]
[[0. 1. 0. 0.]]
[[1. 0.]]




In [21]:
sex_encoder_df.head()

Unnamed: 0,sex_Female,sex_Male
0,1.0,0.0
1,0.0,1.0
2,0.0,1.0
3,0.0,1.0
4,1.0,0.0


In [22]:
day_encoder_df.head()

Unnamed: 0,day_Fri,day_Sat,day_Sun,day_Thur
0,0.0,0.0,1.0,0.0
1,0.0,0.0,1.0,0.0
2,0.0,0.0,1.0,0.0
3,0.0,0.0,1.0,0.0
4,0.0,0.0,1.0,0.0


In [23]:
time_encoder_df.head()

Unnamed: 0,time_Dinner,time_Lunch
0,1.0,0.0
1,1.0,0.0
2,1.0,0.0
3,1.0,0.0
4,1.0,0.0
