# Data Encoding
> The notion of data encoding is simple as suggest by the name itself. For categorical features /variables or data we need to transform the 'string' data into some numerical values. This process is called data encoding. (In simple language)

**Types**
1. Nominal / One Hot Encoding (OHE)
2. Label / Ordinal Encoding
3. Target Guided Ordinal Encoding
    

## Nominal / OHE

It is one of the technique which is used to represent categorical data as a numerical data, which are more suitable for ML Algorithms.
In this method a every value of a categorical feature transformed to a feature. and represented as binary vector where each bit corresponds to a unique category. So it's creates a 'Sparse Matrix' of 0 and 1s.
    - For example if we had a feature of col as [Blue, Green, Yellow], then by OHE we will create 3 Binary features as Blue, green & yellow. And will give the value for 
    
    blue = [1,0,0] (means where blue was true), then green = [0,0,1], Yellow = [0,1,0]

- This also leads us to a limitation of this technique : When we will have many more values of a categorical feature then it'll create too many features so that can lead the model to overfitted with the training set.

In [1]:
# ONE HOT ENCODER
import pandas as pd
from sklearn.preprocessing import OneHotEncoder

In [4]:
df = pd.DataFrame({
    "col" : ["Red", "Blue", 'Green', 'Yellow', 'Black', 'Green', "Red", "Yellow"]
})

In [5]:
df

Unnamed: 0,col
0,Red
1,Blue
2,Green
3,Yellow
4,Black
5,Green
6,Red
7,Yellow


In [6]:
encoder = OneHotEncoder()   #create an encoder object

In [14]:
encoded = encoder.fit_transform(df[['col']]).toarray()

In [16]:
# let's convert this into a df and more understandable view
encoded_df = pd.DataFrame(encoded, columns=encoder.get_feature_names_out() #this will give the features name
                        )

In [18]:
encoded_df
# Now these are my features not the earlier col feature

Unnamed: 0,col_Black,col_Blue,col_Green,col_Red,col_Yellow
0,0.0,0.0,0.0,1.0,0.0
1,0.0,1.0,0.0,0.0,0.0
2,0.0,0.0,1.0,0.0,0.0
3,0.0,0.0,0.0,0.0,1.0
4,1.0,0.0,0.0,0.0,0.0
5,0.0,0.0,1.0,0.0,0.0
6,0.0,0.0,0.0,1.0,0.0
7,0.0,0.0,0.0,0.0,1.0


In [19]:
# now for a new data point in col feature (from already existed values)

encoder.transform([['Green']]).toarray()



array([[0., 0., 1., 0., 0.]])

In [22]:
pd.concat([df, encoded_df], axis=1)

Unnamed: 0,col,col_Black,col_Blue,col_Green,col_Red,col_Yellow
0,Red,0.0,0.0,0.0,1.0,0.0
1,Blue,0.0,1.0,0.0,0.0,0.0
2,Green,0.0,0.0,1.0,0.0,0.0
3,Yellow,0.0,0.0,0.0,0.0,1.0
4,Black,1.0,0.0,0.0,0.0,0.0
5,Green,0.0,0.0,1.0,0.0,0.0
6,Red,0.0,0.0,0.0,1.0,0.0
7,Yellow,0.0,0.0,0.0,0.0,1.0


In [21]:
# curiosity
encoder.transform([['Brown']]).toarray()    #Now if we give a new value, as this will give a ValueError.
# the transform method accepts the existed category



ValueError: Found unknown categories ['Brown'] in column 0 during transform

In [24]:
# Let's use an actual data
import seaborn as sns
data = sns.load_dataset('tips')

In [25]:
data

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
0,16.99,1.01,Female,No,Sun,Dinner,2
1,10.34,1.66,Male,No,Sun,Dinner,3
2,21.01,3.50,Male,No,Sun,Dinner,3
3,23.68,3.31,Male,No,Sun,Dinner,2
4,24.59,3.61,Female,No,Sun,Dinner,4
...,...,...,...,...,...,...,...
239,29.03,5.92,Male,No,Sat,Dinner,3
240,27.18,2.00,Female,Yes,Sat,Dinner,2
241,22.67,2.00,Male,Yes,Sat,Dinner,2
242,17.82,1.75,Male,No,Sat,Dinner,2


In [26]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 244 entries, 0 to 243
Data columns (total 7 columns):
 #   Column      Non-Null Count  Dtype   
---  ------      --------------  -----   
 0   total_bill  244 non-null    float64 
 1   tip         244 non-null    float64 
 2   sex         244 non-null    category
 3   smoker      244 non-null    category
 4   day         244 non-null    category
 5   time        244 non-null    category
 6   size        244 non-null    int64   
dtypes: category(4), float64(2), int64(1)
memory usage: 7.4 KB


In [31]:
# so 'sex', 'smoker', 'day', 'time' all are categorical
# let's try to transform the dataset

encoded_sex = encoder.fit_transform(data[['sex', 'smoker', 'day', 'time']]).toarray()
encoded_tips_df = pd.DataFrame(encoded_sex, columns= encoder.get_feature_names_out())

In [32]:
encoded_tips_df

Unnamed: 0,sex_Female,sex_Male,smoker_No,smoker_Yes,day_Fri,day_Sat,day_Sun,day_Thur,time_Dinner,time_Lunch
0,1.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0
1,0.0,1.0,1.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0
2,0.0,1.0,1.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0
3,0.0,1.0,1.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0
4,1.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0
...,...,...,...,...,...,...,...,...,...,...
239,0.0,1.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0
240,1.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0
241,0.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0
242,0.0,1.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0


In [36]:
# lets make a df of non categorical features
num_df = data[['total_bill','tip','size']]
num_df

Unnamed: 0,total_bill,tip,size
0,16.99,1.01,2
1,10.34,1.66,3
2,21.01,3.50,3
3,23.68,3.31,2
4,24.59,3.61,4
...,...,...,...
239,29.03,5.92,3
240,27.18,2.00,2
241,22.67,2.00,2
242,17.82,1.75,2


In [38]:
updated_data = pd.concat([num_df,encoded_tips_df], axis= 1)
updated_data

Unnamed: 0,total_bill,tip,size,sex_Female,sex_Male,smoker_No,smoker_Yes,day_Fri,day_Sat,day_Sun,day_Thur,time_Dinner,time_Lunch
0,16.99,1.01,2,1.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0
1,10.34,1.66,3,0.0,1.0,1.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0
2,21.01,3.50,3,0.0,1.0,1.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0
3,23.68,3.31,2,0.0,1.0,1.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0
4,24.59,3.61,4,1.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...
239,29.03,5.92,3,0.0,1.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0
240,27.18,2.00,2,1.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0
241,22.67,2.00,2,0.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0
242,17.82,1.75,2,0.0,1.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0


In [None]:
# so this our ultimate data with the use of OHE

## Label / Ordinal Encoding