## __Data Encoding__
The process to convert categorical data (non-numerical) into numerical data is called data encoding. This is necessary because most machine learning algorithms require numerical input data. 

several methos to encode data:
1. One-Hot Encoding (OHE) or nominal encoding
2. Label and Ordinal Encoding
3. Target guided ordinal encoding

### Nominal or one-hot encoding

what is nominal data?  
That type of data, in which order does not matter andd there is no preference for any label, for ex: colors ['Red', 'Blue', 'Green'] or cities ['Mumbai', 'Delhi', 'Bangalore']. here we can't say that 'Red' is greater than 'Blue' or 'Mumbai' is greater than 'Delhi'.

`OHE` is a process of converting categorical data (any text or label) into binary format(0,1) to use it in machine learning algorithms which may require numerical data for processing.

In [52]:
import numpy as np
import pandas as pd
import seaborn as sns
from sklearn.preprocessing import OneHotEncoder

In [53]:
# Creating data to be encoded

data = ['Red','Green','Yellow','Blue']
df = pd.DataFrame(data,columns=['Color'])

df

Unnamed: 0,Color
0,Red
1,Green
2,Yellow
3,Blue


In [54]:
encode = OneHotEncoder() # Making instance of onehotencoder class

encoded_data = encode.fit_transform(df)
encoded_data
# Sparse matrix? : a data in which most of the elements are '0'

<Compressed Sparse Row sparse matrix of dtype 'float64'
	with 4 stored elements and shape (4, 4)>

In [55]:
# Convert sparse matrix into array then df
encoded_arr = encoded_data.toarray()
print('Encoded array:\n',encoded_arr)

encoded_df = pd.DataFrame(encoded_arr,columns=encode.get_feature_names_out())   # To give appropiate column names
encoded_df

Encoded array:
 [[0. 0. 1. 0.]
 [0. 1. 0. 0.]
 [0. 0. 0. 1.]
 [1. 0. 0. 0.]]


Unnamed: 0,Color_Blue,Color_Green,Color_Red,Color_Yellow
0,0.0,0.0,1.0,0.0
1,0.0,1.0,0.0,0.0
2,0.0,0.0,0.0,1.0
3,1.0,0.0,0.0,0.0


In [56]:
"Therefore , finally encoded data is"
final_df = pd.concat([df,encoded_df],axis=1)
final_df

Unnamed: 0,Color,Color_Blue,Color_Green,Color_Red,Color_Yellow
0,Red,0.0,0.0,1.0,0.0
1,Green,0.0,1.0,0.0,0.0
2,Yellow,0.0,0.0,0.0,1.0
3,Blue,1.0,0.0,0.0,0.0


#### Key points to remember:
1. OHE is used for `nominal` data. (When theres no preference for any label, (where order doesnt matter))
2. OHE increases the `dimensionality` of the data. i.e it creates new feature (column) for each unique label (see above). This may make our data large and computationally expensive to process.
3. `Sparse data` is created in OHE, which means most of the data in the new columns will be 0s. This may make data insufficient for storage and processing.

#### ___Example___
encode all categorical columns in the dataset using OHE.

In [57]:
data = sns.load_dataset('tips')
data

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
0,16.99,1.01,Female,No,Sun,Dinner,2
1,10.34,1.66,Male,No,Sun,Dinner,3
2,21.01,3.50,Male,No,Sun,Dinner,3
3,23.68,3.31,Male,No,Sun,Dinner,2
4,24.59,3.61,Female,No,Sun,Dinner,4
...,...,...,...,...,...,...,...
239,29.03,5.92,Male,No,Sat,Dinner,3
240,27.18,2.00,Female,Yes,Sat,Dinner,2
241,22.67,2.00,Male,Yes,Sat,Dinner,2
242,17.82,1.75,Male,No,Sat,Dinner,2


In [58]:
encode = OneHotEncoder()
encoded_data = encode.fit_transform(data).toarray()

encoded_df = pd.DataFrame(encoded_data,columns=encode.get_feature_names_out())
encoded_df.head()

# Total 368 columns(features), since there are 368 unique samples

Unnamed: 0,total_bill_3.07,total_bill_5.75,total_bill_7.25,total_bill_7.51,total_bill_7.56,total_bill_7.74,total_bill_8.35,total_bill_8.51,total_bill_8.52,total_bill_8.58,...,day_Sun,day_Thur,time_Dinner,time_Lunch,size_1,size_2,size_3,size_4,size_5,size_6
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,1.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,1.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,1.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,1.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,1.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0


In [59]:
final_df = pd.concat([data,encoded_df],axis = 1)
final_df

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size,total_bill_3.07,total_bill_5.75,total_bill_7.25,...,day_Sun,day_Thur,time_Dinner,time_Lunch,size_1,size_2,size_3,size_4,size_5,size_6
0,16.99,1.01,Female,No,Sun,Dinner,2,0.0,0.0,0.0,...,1.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
1,10.34,1.66,Male,No,Sun,Dinner,3,0.0,0.0,0.0,...,1.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
2,21.01,3.50,Male,No,Sun,Dinner,3,0.0,0.0,0.0,...,1.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
3,23.68,3.31,Male,No,Sun,Dinner,2,0.0,0.0,0.0,...,1.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
4,24.59,3.61,Female,No,Sun,Dinner,4,0.0,0.0,0.0,...,1.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
239,29.03,5.92,Male,No,Sat,Dinner,3,0.0,0.0,0.0,...,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
240,27.18,2.00,Female,Yes,Sat,Dinner,2,0.0,0.0,0.0,...,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
241,22.67,2.00,Male,Yes,Sat,Dinner,2,0.0,0.0,0.0,...,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
242,17.82,1.75,Male,No,Sat,Dinner,2,0.0,0.0,0.0,...,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
