### Feature Encoding

#### What is Encoding?
        - Representing categorical data into numeric data - Label Encoding
        - e.g Sardar = 1, Shahzad = 2, Shahzeb = 3, Shahzain = 4
        - nominal, ordinal encoding etc

    Types of Encoding:
        - Label Encoding: unordered
        - One Hot Encoding: one hot value in each column, creates identity matrix
        - Ordinal Encoding: ordered
        - Binary Encoding: Categories are converted into numerical labels, then those are covered into binary codes
        - Frequency/Count Encoding: assigns the values their frequency counts

#### ---------------------------------
#### Why do we need to Encode the data?
        - mostly algorithms take/prefer numerical data
        - computers prefers numbers over data
        - text is lengthy, numbers are small, saves computational power
        
        - Algorithm Compatibility
            - mostly algorithms take/prefer numerical data
        - Efficiency & Performance:
            - Faster computation of numeric data, storage efficiency
        - Feature Representation
            - Same numeric code for a variable in multiple languages -> universal represntation
        - Support Unseen Categories
        - Better Memory Usage
#### ----------------------------------

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [2]:
# data load
df = sns.load_dataset('tips')
df.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
0,16.99,1.01,Female,No,Sun,Dinner,2
1,10.34,1.66,Male,No,Sun,Dinner,3
2,21.01,3.5,Male,No,Sun,Dinner,3
3,23.68,3.31,Male,No,Sun,Dinner,2
4,24.59,3.61,Female,No,Sun,Dinner,4


In [3]:
df['time'].value_counts()

time
Dinner    176
Lunch      68
Name: count, dtype: int64

In [4]:
df['day'].value_counts()

day
Sat     87
Sun     76
Thur    62
Fri     19
Name: count, dtype: int64

In [5]:
df['sex'].value_counts()

sex
Male      157
Female     87
Name: count, dtype: int64

In [6]:
df['smoker'].value_counts()

smoker
No     151
Yes     93
Name: count, dtype: int64

In [7]:
# Encoding time variable in label encoder

from sklearn.preprocessing import LabelEncoder, OneHotEncoder, OrdinalEncoder

le = LabelEncoder()

df['encoded_time'] = le.fit_transform(df['time'])

In [8]:
df.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size,encoded_time
0,16.99,1.01,Female,No,Sun,Dinner,2,0
1,10.34,1.66,Male,No,Sun,Dinner,3,0
2,21.01,3.5,Male,No,Sun,Dinner,3,0
3,23.68,3.31,Male,No,Sun,Dinner,2,0
4,24.59,3.61,Female,No,Sun,Dinner,4,0


In [10]:
df['time'].value_counts()

time
Dinner    176
Lunch      68
Name: count, dtype: int64

In [11]:
df['encoded_time'].value_counts()

encoded_time
0    176
1     68
Name: count, dtype: int64

In [14]:
df['day'].value_counts()

day
Sat     87
Sun     76
Thur    62
Fri     19
Name: count, dtype: int64

In [15]:
# ORdinal encoding the days using specific order

oe = OrdinalEncoder(categories=[['Thur', 'Fri', 'Sat', 'Sun']])
### oe = OrdinalEncoder() also works and assigns on descending order

df['encoded_days'] = oe.fit_transform(df[['day']])
df.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size,encoded_time,encoded_days
0,16.99,1.01,Female,No,Sun,Dinner,2,0,3.0
1,10.34,1.66,Male,No,Sun,Dinner,3,0,3.0
2,21.01,3.5,Male,No,Sun,Dinner,3,0,3.0
3,23.68,3.31,Male,No,Sun,Dinner,2,0,3.0
4,24.59,3.61,Female,No,Sun,Dinner,4,0,3.0


In [16]:
df['encoded_days'].value_counts()

encoded_days
2.0    87
3.0    76
0.0    62
1.0    19
Name: count, dtype: int64