# Data Encoding
> The notion of data encoding is simple as suggest by the name itself. For categorical features /variables or data we need to transform the 'string' data into some numerical values. This process is called data encoding. (In simple language)

**Types**
1. Nominal / One Hot Encoding (OHE)
2. Label / Ordinal Encoding
3. Target Guided Ordinal Encoding
    

## Nominal / OHE

It is one of the technique which is used to represent categorical data as a numerical data, which are more suitable for ML Algorithms.
In this method a every value of a categorical feature transformed to a feature. and represented as binary vector where each bit corresponds to a unique category. So it's creates a 'Sparse Matrix' of 0 and 1s.
    - For example if we had a feature of col as [Blue, Green, Yellow], then by OHE we will create 3 Binary features as Blue, green & yellow. And will give the value for 
    
    blue = [1,0,0] (means where blue was true), then green = [0,0,1], Yellow = [0,1,0]

- This also leads us to a limitation of this technique : When we will have many more values of a categorical feature then it'll create too many features so that can lead the model to overfitted with the training set.

In [3]:
# ONE HOT ENCODER
import pandas as pd
from sklearn.preprocessing import OneHotEncoder

In [4]:
df = pd.DataFrame({
    "col" : ["Red", "Blue", 'Green', 'Yellow', 'Black', 'Green', "Red", "Yellow"]
})

In [5]:
df

Unnamed: 0,col
0,Red
1,Blue
2,Green
3,Yellow
4,Black
5,Green
6,Red
7,Yellow


In [6]:
encoder = OneHotEncoder()   #create an encoder object

In [7]:
encoded = encoder.fit_transform(df[['col']]).toarray()

In [8]:
# let's convert this into a df and more understandable view
encoded_df = pd.DataFrame(encoded, columns=encoder.get_feature_names_out() #this will give the features name
                        )

In [9]:
encoded_df
# Now these are my features not the earlier col feature

Unnamed: 0,col_Black,col_Blue,col_Green,col_Red,col_Yellow
0,0.0,0.0,0.0,1.0,0.0
1,0.0,1.0,0.0,0.0,0.0
2,0.0,0.0,1.0,0.0,0.0
3,0.0,0.0,0.0,0.0,1.0
4,1.0,0.0,0.0,0.0,0.0
5,0.0,0.0,1.0,0.0,0.0
6,0.0,0.0,0.0,1.0,0.0
7,0.0,0.0,0.0,0.0,1.0


In [10]:
# now for a new data point in col feature (from already existed values)

encoder.transform([['Green']]).toarray()



array([[0., 0., 1., 0., 0.]])

In [11]:
pd.concat([df, encoded_df], axis=1)

Unnamed: 0,col,col_Black,col_Blue,col_Green,col_Red,col_Yellow
0,Red,0.0,0.0,0.0,1.0,0.0
1,Blue,0.0,1.0,0.0,0.0,0.0
2,Green,0.0,0.0,1.0,0.0,0.0
3,Yellow,0.0,0.0,0.0,0.0,1.0
4,Black,1.0,0.0,0.0,0.0,0.0
5,Green,0.0,0.0,1.0,0.0,0.0
6,Red,0.0,0.0,0.0,1.0,0.0
7,Yellow,0.0,0.0,0.0,0.0,1.0


In [12]:
# curiosity
#encoder.transform([['Brown']]).toarray()    
# 
# #Now if we give a new value, as this will give a ValueError.
# the transform method accepts the existed category

In [13]:
# Let's use an actual data
import seaborn as sns
data = sns.load_dataset('tips')

In [14]:
data

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
0,16.99,1.01,Female,No,Sun,Dinner,2
1,10.34,1.66,Male,No,Sun,Dinner,3
2,21.01,3.50,Male,No,Sun,Dinner,3
3,23.68,3.31,Male,No,Sun,Dinner,2
4,24.59,3.61,Female,No,Sun,Dinner,4
...,...,...,...,...,...,...,...
239,29.03,5.92,Male,No,Sat,Dinner,3
240,27.18,2.00,Female,Yes,Sat,Dinner,2
241,22.67,2.00,Male,Yes,Sat,Dinner,2
242,17.82,1.75,Male,No,Sat,Dinner,2


In [15]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 244 entries, 0 to 243
Data columns (total 7 columns):
 #   Column      Non-Null Count  Dtype   
---  ------      --------------  -----   
 0   total_bill  244 non-null    float64 
 1   tip         244 non-null    float64 
 2   sex         244 non-null    category
 3   smoker      244 non-null    category
 4   day         244 non-null    category
 5   time        244 non-null    category
 6   size        244 non-null    int64   
dtypes: category(4), float64(2), int64(1)
memory usage: 7.4 KB


In [16]:
# so 'sex', 'smoker', 'day', 'time' all are categorical
# let's try to transform the dataset

encoded_sex = encoder.fit_transform(data[['sex', 'smoker', 'day', 'time']]).toarray()
encoded_tips_df = pd.DataFrame(encoded_sex, columns= encoder.get_feature_names_out())

In [17]:
encoded_tips_df

Unnamed: 0,sex_Female,sex_Male,smoker_No,smoker_Yes,day_Fri,day_Sat,day_Sun,day_Thur,time_Dinner,time_Lunch
0,1.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0
1,0.0,1.0,1.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0
2,0.0,1.0,1.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0
3,0.0,1.0,1.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0
4,1.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0
...,...,...,...,...,...,...,...,...,...,...
239,0.0,1.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0
240,1.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0
241,0.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0
242,0.0,1.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0


In [18]:
# lets make a df of non categorical features
num_df = data[['total_bill','tip','size']]
num_df

Unnamed: 0,total_bill,tip,size
0,16.99,1.01,2
1,10.34,1.66,3
2,21.01,3.50,3
3,23.68,3.31,2
4,24.59,3.61,4
...,...,...,...
239,29.03,5.92,3
240,27.18,2.00,2
241,22.67,2.00,2
242,17.82,1.75,2


In [19]:
updated_data = pd.concat([num_df,encoded_tips_df], axis= 1)
updated_data

Unnamed: 0,total_bill,tip,size,sex_Female,sex_Male,smoker_No,smoker_Yes,day_Fri,day_Sat,day_Sun,day_Thur,time_Dinner,time_Lunch
0,16.99,1.01,2,1.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0
1,10.34,1.66,3,0.0,1.0,1.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0
2,21.01,3.50,3,0.0,1.0,1.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0
3,23.68,3.31,2,0.0,1.0,1.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0
4,24.59,3.61,4,1.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...
239,29.03,5.92,3,0.0,1.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0
240,27.18,2.00,2,1.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0
241,22.67,2.00,2,0.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0
242,17.82,1.75,2,0.0,1.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0


In [20]:
# so this our ultimate data with the use of OHE

## Label Encoding
Label encoding assigns an unique num value to each value the categorical variables. The labels are usually assigned in alphabetical order.

In [21]:
from sklearn.preprocessing import LabelEncoder
lbl_encoder = LabelEncoder()

In [22]:
df = pd.DataFrame({
    "col" : ["Red", "Blue", 'Green', 'Yellow', 'Black', 'Green', "Red", "Yellow"]
})

In [24]:
lbl_encoded = lbl_encoder.fit_transform(df[["col"]])

  y = column_or_1d(y, warn=True)


In [25]:
lbl_df = pd.DataFrame(lbl_encoded)
pd.concat([df,lbl_df],axis=1)

# This is what has been done in label encoding

Unnamed: 0,col,0
0,Red,3
1,Blue,1
2,Green,2
3,Yellow,4
4,Black,0
5,Green,2
6,Red,3
7,Yellow,4


In [26]:
#okay so label encoder assigning value as an alphabetical order which may lead to some problems like, the model will treat Red (3) as greater than Blue (1). So the problem is with the ordinal data.

In [27]:
lbl_encoder.transform([["Red"]])

  y = column_or_1d(y, dtype=self.classes_.dtype, warn=True)


array([3])

### Ordinal Encoding
When some sort of Ranks are needed. i.e., For categorical data where an order is related. each value of the category assigned a numerical value based on its position in the order. useful in categories like education level, clothes sizes


In [28]:
from sklearn.preprocessing import OrdinalEncoder



In [29]:
# create an ordinal dataframe
df = pd.DataFrame({
    'size' : ['small', 'medium', 'large', 'large', 'medium','small', 'large']
})
df

Unnamed: 0,size
0,small
1,medium
2,large
3,large
4,medium
5,small
6,large


In [30]:
ord_encoder = OrdinalEncoder(categories=[['small', 'medium','large']])  #by mentioning the categories in proper order we are making sure how we are giving the ranks
ord_encoder.fit_transform(df[['size']])

array([[0.],
       [1.],
       [2.],
       [2.],
       [1.],
       [0.],
       [2.]])

## Target guided Ordinal Encoding

This technique is used to encode categorical variables **based on their relationship with the target (dependent) variable**. This method becomes useful when we have large number of unique categories in a categorical variables.

With this method we replace each category of the variable with a **numerical value based on the mean or median of the target variable** for that category. this creates a monotonic relationship between the categorical variable and target variable, which can definitely improve the effectiveness of our moodel.

In [31]:
# let's create a dataset
df = pd.DataFrame({
    'city': ['NY', "London", "Paris", "NY", "Paris", "Kolkata", "Delhi"],
    'price': [200, 150, 500, 250 , 300, 190, 200]
})
df

Unnamed: 0,city,price
0,NY,200
1,London,150
2,Paris,500
3,NY,250
4,Paris,300
5,Kolkata,190
6,Delhi,200


In [34]:
# we will treat the price as a target variable.
# so what we do is, see there are two values corresponding to ny 200 and 250 so we will replace the ny by the mean (or median) of the target value of that categoory


df.groupby('city')['price'].mean()

city
Delhi      200.0
Kolkata    190.0
London     150.0
NY         225.0
Paris      400.0
Name: price, dtype: float64

In [35]:
mean_encoded = df.groupby('city')['price'].mean().to_dict()
mean_encoded

{'Delhi': 200.0,
 'Kolkata': 190.0,
 'London': 150.0,
 'NY': 225.0,
 'Paris': 400.0}

In [36]:
# now what we will do is we will replace the city names by the mean prices
df['city_encoded'] = df['city'].map(mean_encoded)   #this will map the city names with means
df

Unnamed: 0,city,price,city_encoded
0,NY,200,225.0
1,London,150,150.0
2,Paris,500,400.0
3,NY,250,225.0
4,Paris,300,400.0
5,Kolkata,190,190.0
6,Delhi,200,200.0


In [37]:
# Now we don't need the city names anymore
df[['price', 'city_encoded']]

Unnamed: 0,price,city_encoded
0,200,225.0
1,150,150.0
2,500,400.0
3,250,225.0
4,300,400.0
5,190,190.0
6,200,200.0


In [38]:
# let's try with a practical dataset 'tips'
import seaborn as sns
data = sns.load_dataset('tips')
data

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
0,16.99,1.01,Female,No,Sun,Dinner,2
1,10.34,1.66,Male,No,Sun,Dinner,3
2,21.01,3.50,Male,No,Sun,Dinner,3
3,23.68,3.31,Male,No,Sun,Dinner,2
4,24.59,3.61,Female,No,Sun,Dinner,4
...,...,...,...,...,...,...,...
239,29.03,5.92,Male,No,Sat,Dinner,3
240,27.18,2.00,Female,Yes,Sat,Dinner,2
241,22.67,2.00,Male,Yes,Sat,Dinner,2
242,17.82,1.75,Male,No,Sat,Dinner,2


In [39]:
# CONVERT THE TIME BASED ON TOTAL BILL

data2 = data[['total_bill', 'time']]
data2

Unnamed: 0,total_bill,time
0,16.99,Dinner
1,10.34,Dinner
2,21.01,Dinner
3,23.68,Dinner
4,24.59,Dinner
...,...,...
239,29.03,Dinner
240,27.18,Dinner
241,22.67,Dinner
242,17.82,Dinner


In [42]:
mean_bill_based_on_time = data2.groupby('time').mean()
mean_bill_based_on_time

  mean_bill_based_on_time = data2.groupby('time').mean()


Unnamed: 0_level_0,total_bill
time,Unnamed: 1_level_1
Lunch,17.168676
Dinner,20.797159


In [43]:
# so now 17.2 will represent Lunch
# and 20.8 will represent dinner
data2['encoded_time'] = data2['time'].map(mean_bill_based_on_time['total_bill'])
data2

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data2['encoded_time'] = data2['time'].map(mean_bill_based_on_time['total_bill'])


Unnamed: 0,total_bill,time,encoded_time
0,16.99,Dinner,20.797159
1,10.34,Dinner,20.797159
2,21.01,Dinner,20.797159
3,23.68,Dinner,20.797159
4,24.59,Dinner,20.797159
...,...,...,...
239,29.03,Dinner,20.797159
240,27.18,Dinner,20.797159
241,22.67,Dinner,20.797159
242,17.82,Dinner,20.797159


In [44]:
data2[['total_bill', 'encoded_time']]

Unnamed: 0,total_bill,encoded_time
0,16.99,20.797159
1,10.34,20.797159
2,21.01,20.797159
3,23.68,20.797159
4,24.59,20.797159
...,...,...
239,29.03,20.797159
240,27.18,20.797159
241,22.67,20.797159
242,17.82,20.797159


In [45]:
# So till now all of these were part of the work we need to do before training our model. This is part of data preprocessing. Next to learn is EDA. The practical implementation. Then move to ML algorithms.