## __Data Encoding__
The process to convert categorical data (non-numerical) into numerical data is called data encoding. This is necessary because most machine learning algorithms require numerical input data. 

several methos to encode data:
1. One-Hot Encoding (OHE) or nominal encoding
2. Label and Ordinal Encoding
3. Target guided ordinal encoding

### Nominal or one-hot encoding

what is nominal data?  
That type of data, in which order does not matter andd there is no preference for any label, for ex: colors ['Red', 'Blue', 'Green'] or cities ['Mumbai', 'Delhi', 'Bangalore']. here we can't say that 'Red' is greater than 'Blue' or 'Mumbai' is greater than 'Delhi'.

`OHE` is a process of converting categorical data (any text or label) into binary format(0,1) to use it in machine learning algorithms which may require numerical data for processing.

In [1]:
import numpy as np
import pandas as pd
import seaborn as sns
from sklearn.preprocessing import OneHotEncoder

In [2]:
# Creating data to be encoded

data = ['Red','Green','Yellow','Blue']
df = pd.DataFrame(data,columns=['Color'])

df

Unnamed: 0,Color
0,Red
1,Green
2,Yellow
3,Blue


In [3]:
encode = OneHotEncoder() # Making instance of onehotencoder class

encoded_data = encode.fit_transform(df)
encoded_data
# Sparse matrix? : a data in which most of the elements are '0'

<Compressed Sparse Row sparse matrix of dtype 'float64'
	with 4 stored elements and shape (4, 4)>

In [4]:
# Convert sparse matrix into array then df
encoded_arr = encoded_data.toarray()
print('Encoded array:\n',encoded_arr)

encoded_df = pd.DataFrame(encoded_arr,columns=encode.get_feature_names_out())   # To give appropiate column names
encoded_df

Encoded array:
 [[0. 0. 1. 0.]
 [0. 1. 0. 0.]
 [0. 0. 0. 1.]
 [1. 0. 0. 0.]]


Unnamed: 0,Color_Blue,Color_Green,Color_Red,Color_Yellow
0,0.0,0.0,1.0,0.0
1,0.0,1.0,0.0,0.0
2,0.0,0.0,0.0,1.0
3,1.0,0.0,0.0,0.0


In [5]:
"Therefore , finally encoded data is"
final_df = pd.concat([df,encoded_df],axis=1)
final_df

Unnamed: 0,Color,Color_Blue,Color_Green,Color_Red,Color_Yellow
0,Red,0.0,0.0,1.0,0.0
1,Green,0.0,1.0,0.0,0.0
2,Yellow,0.0,0.0,0.0,1.0
3,Blue,1.0,0.0,0.0,0.0


#### Key points to remember:
1. OHE is used for `nominal` data. (When theres no preference for any label, (where order doesnt matter))
2. OHE increases the `dimensionality` of the data. i.e it creates new feature (column) for each unique label (see above). This may make our data large and computationally expensive to process.
3. `Sparse data` is created in OHE, which means most of the data in the new columns will be 0s. This may make data insufficient for storage and processing.

#### ___Example___
encode all categorical columns in the dataset using OHE.

In [6]:
data = sns.load_dataset('tips')
data

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
0,16.99,1.01,Female,No,Sun,Dinner,2
1,10.34,1.66,Male,No,Sun,Dinner,3
2,21.01,3.50,Male,No,Sun,Dinner,3
3,23.68,3.31,Male,No,Sun,Dinner,2
4,24.59,3.61,Female,No,Sun,Dinner,4
...,...,...,...,...,...,...,...
239,29.03,5.92,Male,No,Sat,Dinner,3
240,27.18,2.00,Female,Yes,Sat,Dinner,2
241,22.67,2.00,Male,Yes,Sat,Dinner,2
242,17.82,1.75,Male,No,Sat,Dinner,2


In [7]:
encode = OneHotEncoder()
encoded_data = encode.fit_transform(data).toarray()

encoded_df = pd.DataFrame(encoded_data,columns=encode.get_feature_names_out())
encoded_df.head()

# Total 368 columns(features), since there are 368 unique samples

Unnamed: 0,total_bill_3.07,total_bill_5.75,total_bill_7.25,total_bill_7.51,total_bill_7.56,total_bill_7.74,total_bill_8.35,total_bill_8.51,total_bill_8.52,total_bill_8.58,...,day_Sun,day_Thur,time_Dinner,time_Lunch,size_1,size_2,size_3,size_4,size_5,size_6
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,1.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,1.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,1.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,1.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,1.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0


In [8]:
final_df = pd.concat([data,encoded_df],axis = 1)
final_df

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size,total_bill_3.07,total_bill_5.75,total_bill_7.25,...,day_Sun,day_Thur,time_Dinner,time_Lunch,size_1,size_2,size_3,size_4,size_5,size_6
0,16.99,1.01,Female,No,Sun,Dinner,2,0.0,0.0,0.0,...,1.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
1,10.34,1.66,Male,No,Sun,Dinner,3,0.0,0.0,0.0,...,1.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
2,21.01,3.50,Male,No,Sun,Dinner,3,0.0,0.0,0.0,...,1.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
3,23.68,3.31,Male,No,Sun,Dinner,2,0.0,0.0,0.0,...,1.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
4,24.59,3.61,Female,No,Sun,Dinner,4,0.0,0.0,0.0,...,1.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
239,29.03,5.92,Male,No,Sat,Dinner,3,0.0,0.0,0.0,...,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
240,27.18,2.00,Female,Yes,Sat,Dinner,2,0.0,0.0,0.0,...,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
241,22.67,2.00,Male,Yes,Sat,Dinner,2,0.0,0.0,0.0,...,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
242,17.82,1.75,Male,No,Sat,Dinner,2,0.0,0.0,0.0,...,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0


### Label encoding

The process of assigning unique categorical data points a unique numerical lebal.  
The labels assgined are usually based on alphabetical order or frequency of them  
ex : ['red','yellow','green'] → 2,3,1

In [9]:
from sklearn.preprocessing import LabelEncoder

In [10]:
data = ['Red','Green','Yellow','Green','Red','Blue']
arr = np.array(data)
arr

array(['Red', 'Green', 'Yellow', 'Green', 'Red', 'Blue'], dtype='<U6')

In [11]:
encode = LabelEncoder()  #Creating instance
encoded_Arr = encode.fit_transform(arr)
# you will get warning if give 'dataframe' rather than 'array'

print('The encoded(transformed) array is:\n',encoded_Arr)

The encoded(transformed) array is:
 [2 1 3 1 2 0]


In [12]:
original_df = pd.DataFrame(arr,columns=['Colours'])
encoded_df = pd.DataFrame(encoded_Arr,columns=['Labels'])

final_df = pd.concat([original_df,encoded_df],axis=1)
final_df

Unnamed: 0,Colours,Labels
0,Red,2
1,Green,1
2,Yellow,3
3,Green,1
4,Red,2
5,Blue,0


### Ordinal Encoding

The process of converting categorical data into numerical labe.  
ex : ['small', 'medium', 'large'] -> [0, 1, 2]

This only gives one single column after encoding.  
Use this when we have ordinal data (data in which preference matters)  
We dont use it for nominal data cuz then ML algorithm may consider label as values. ex : ['red','green','yellow'] → [0,1,2], the algorithm may consider 'green' (1) > 'red' (0)

In [13]:
from sklearn.preprocessing import OrdinalEncoder

In [14]:
# Creating data
data = pd.DataFrame({'Size':['Small','Medium','Large','XL','Small','XL']})
data

Unnamed: 0,Size
0,Small
1,Medium
2,Large
3,XL
4,Small
5,XL


In [15]:
encode = OrdinalEncoder(categories=[['Small','Medium','Large','XL']]) # give the order
encoded_data = encode.fit_transform(data)
encoded_data

array([[0.],
       [1.],
       [2.],
       [3.],
       [0.],
       [3.]])

In [16]:
encoded_df = pd.DataFrame(encoded_data,columns=['Label'])
encoded_df

Unnamed: 0,Label
0,0.0
1,1.0
2,2.0
3,3.0
4,0.0
5,3.0


In [17]:
final_df = pd.concat([data,encoded_df],axis=1)
final_df

Unnamed: 0,Size,Label
0,Small,0.0
1,Medium,1.0
2,Large,2.0
3,XL,3.0
4,Small,0.0
5,XL,3.0


### Target guided ordinal encoding
Encoding of categorical variables based on the relationship with their target values.  
It is usefull when we have a large amount of categories in categorical variable in ordinal dataset.  
Here, we replace each unique categorical variable with mean/median of its target value

In [18]:
data = pd.DataFrame({
    'City':['Delhi','Panjab','HYD','Chennai anna','Delhi','Mumbai','Panjab','HYD'],
    'Price':[100,120,130,140,150,160,170,180]
})
data

Unnamed: 0,City,Price
0,Delhi,100
1,Panjab,120
2,HYD,130
3,Chennai anna,140
4,Delhi,150
5,Mumbai,160
6,Panjab,170
7,HYD,180


In [19]:
mean = data.groupby('City')['Price'].mean()
mean

City
Chennai anna    140.0
Delhi           125.0
HYD             155.0
Mumbai          160.0
Panjab          145.0
Name: Price, dtype: float64

In [20]:
data['Encoded_city'] = data['City'].map(mean)
# This is how we apply a map function to data_Series in pandas
data

Unnamed: 0,City,Price,Encoded_city
0,Delhi,100,125.0
1,Panjab,120,145.0
2,HYD,130,155.0
3,Chennai anna,140,140.0
4,Delhi,150,125.0
5,Mumbai,160,160.0
6,Panjab,170,145.0
7,HYD,180,155.0


In [21]:
data[['Price','Encoded_city']]

Unnamed: 0,Price,Encoded_city
0,100,125.0
1,120,145.0
2,130,155.0
3,140,140.0
4,150,125.0
5,160,160.0
6,170,145.0
7,180,155.0
