# Machine Learning

## Feature Encoding
### `Feature Encoding` is the process of converting categorical data (such as text labels or categories) into a numerical format that can be used by machine learning algorithms. Most algorithms require input features to be numeric, so encoding is essential for handling non-numeric (categorical) variables.
### Main Encoders include
- Label Encoder
- One Hot Encoder
- Ordinal Encoder

In [1]:
# Importing Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [4]:
# Loading the dataset
df = sns.load_dataset("titanic")
print(df.head())

   survived  pclass     sex   age  sibsp  parch     fare embarked  class  \
0         0       3    male  22.0      1      0   7.2500        S  Third   
1         1       1  female  38.0      1      0  71.2833        C  First   
2         1       3  female  26.0      0      0   7.9250        S  Third   
3         1       1  female  35.0      1      0  53.1000        S  First   
4         0       3    male  35.0      0      0   8.0500        S  Third   

     who  adult_male deck  embark_town alive  alone  
0    man        True  NaN  Southampton    no  False  
1  woman       False    C    Cherbourg   yes  False  
2  woman       False  NaN  Southampton   yes   True  
3  woman       False    C  Southampton   yes  False  
4    man        True  NaN  Southampton    no   True  


In [6]:
# Displaying the unique values in the 'who' column
print(df["who"].unique())
# Diaplaying the value counts of the 'who' column
print(df["who"].value_counts())

['man' 'woman' 'child']
who
man      537
woman    271
child     83
Name: count, dtype: int64


## Label Encoder

In [19]:
# Importing Differnet Encoders from SK-Learn Library
from sklearn.preprocessing import LabelEncoder, OneHotEncoder, OrdinalEncoder
# Creating an instance of LabelEncoder
le = LabelEncoder()
# Encoding the 'who' column using LabelEncoder
df["Encoded_who"] = le.fit_transform(df["who"])
# Displaying the first few rows of the DataFrame with the new encoded column
print(df[["who", "Encoded_who"]].head(10))
print("---------------------------------")
# Check unique values in multiple columns
columns = ['who', 'Encoded_who']  
for col in columns:
    print(f"Unique values in {col}: {df[col].unique()}")
print("---------------------------------")    
# Displaying the value counts of the 'Encoded_who' column
columns = ['who', 'Encoded_who']  
for col in columns:
    print(f"Unique values in  {df[col].value_counts()}")

     who  Encoded_who
0    man            1
1  woman            2
2  woman            2
3  woman            2
4    man            1
5    man            1
6    man            1
7  child            0
8  woman            2
9  child            0
---------------------------------
Unique values in who: ['man' 'woman' 'child']
Unique values in Encoded_who: [1 2 0]
---------------------------------
Unique values in  who
man      537
woman    271
child     83
Name: count, dtype: int64
Unique values in  Encoded_who
1    537
2    271
0     83
Name: count, dtype: int64


## One Hot Encoder

In [48]:
# Loading the dataset
df = sns.load_dataset("tips")
print(df.head())
print("---------------------------------")
# Displaying the unique values in the 'day' column
print(df["day"].unique())
# Displaying the value counts of the 'day' column
print(df["day"].value_counts())

   total_bill   tip     sex smoker  day    time  size
0       16.99  1.01  Female     No  Sun  Dinner     2
1       10.34  1.66    Male     No  Sun  Dinner     3
2       21.01  3.50    Male     No  Sun  Dinner     3
3       23.68  3.31    Male     No  Sun  Dinner     2
4       24.59  3.61  Female     No  Sun  Dinner     4
---------------------------------
['Sun', 'Sat', 'Thur', 'Fri']
Categories (4, object): ['Thur', 'Fri', 'Sat', 'Sun']
day
Sat     87
Sun     76
Thur    62
Fri     19
Name: count, dtype: int64


In [49]:
# Creating an instance of OneHotEncoder
ohe = OneHotEncoder(sparse_output=False, drop=None)
# Encoding the 'day' column using OneHotEncoder
Day_Encoded = ohe.fit_transform(df[["day"]])
# Displaying the original DataFrame with the new encoded columns
encoded_cols = ohe.get_feature_names_out(["day"])
df = pd.concat([df, pd.DataFrame(Day_Encoded, columns=encoded_cols, index=df.index)], axis=1)
print("First 5 rows of the DataFrame with OneHotEncoded columns:")
print(df.head(5))
print("Last 5 rows of the DataFrame with OneHotEncoded columns:")
print(df.tail(5))
# Displaying the unique values in the new encoded columns
print("---------------------------------")
print("Unique values in the new encoded columns:")
for col in encoded_cols:
    print(f"Unique values in {col}: {df[col].unique()}")

First 5 rows of the DataFrame with OneHotEncoded columns:
   total_bill   tip     sex smoker  day    time  size  day_Fri  day_Sat  \
0       16.99  1.01  Female     No  Sun  Dinner     2      0.0      0.0   
1       10.34  1.66    Male     No  Sun  Dinner     3      0.0      0.0   
2       21.01  3.50    Male     No  Sun  Dinner     3      0.0      0.0   
3       23.68  3.31    Male     No  Sun  Dinner     2      0.0      0.0   
4       24.59  3.61  Female     No  Sun  Dinner     4      0.0      0.0   

   day_Sun  day_Thur  
0      1.0       0.0  
1      1.0       0.0  
2      1.0       0.0  
3      1.0       0.0  
4      1.0       0.0  
Last 5 rows of the DataFrame with OneHotEncoded columns:
     total_bill   tip     sex smoker   day    time  size  day_Fri  day_Sat  \
239       29.03  5.92    Male     No   Sat  Dinner     3      0.0      1.0   
240       27.18  2.00  Female    Yes   Sat  Dinner     2      0.0      1.0   
241       22.67  2.00    Male    Yes   Sat  Dinner     2      

In [50]:
# Display the only newly created columns
print("---------------------------------")
print("Only newly created columns:")
print(df[encoded_cols].head(5))
# Displaying the value counts of the new encoded columns
print("---------------------------------")
for col in encoded_cols:
    print(f"Value counts for {col}:")
    print(df[col].value_counts())
    print("---------------------------------")

---------------------------------
Only newly created columns:
   day_Fri  day_Sat  day_Sun  day_Thur
0      0.0      0.0      1.0       0.0
1      0.0      0.0      1.0       0.0
2      0.0      0.0      1.0       0.0
3      0.0      0.0      1.0       0.0
4      0.0      0.0      1.0       0.0
---------------------------------
Value counts for day_Fri:
day_Fri
0.0    225
1.0     19
Name: count, dtype: int64
---------------------------------
Value counts for day_Sat:
day_Sat
0.0    157
1.0     87
Name: count, dtype: int64
---------------------------------
Value counts for day_Sun:
day_Sun
0.0    168
1.0     76
Name: count, dtype: int64
---------------------------------
Value counts for day_Thur:
day_Thur
0.0    182
1.0     62
Name: count, dtype: int64
---------------------------------


In [52]:
# For Confirmation, displaying the value counts of the original 'day' column
print("Value counts for original 'day' column:")
print(df["day"].value_counts())

Value counts for original 'day' column:
day
Sat     87
Sun     76
Thur    62
Fri     19
Name: count, dtype: int64


## Ordinal Encoder

In [53]:
df = sns.load_dataset("titanic")
# Displaying the first few rows of the DataFrame
print(df.head())

   survived  pclass     sex   age  sibsp  parch     fare embarked  class  \
0         0       3    male  22.0      1      0   7.2500        S  Third   
1         1       1  female  38.0      1      0  71.2833        C  First   
2         1       3  female  26.0      0      0   7.9250        S  Third   
3         1       1  female  35.0      1      0  53.1000        S  First   
4         0       3    male  35.0      0      0   8.0500        S  Third   

     who  adult_male deck  embark_town alive  alone  
0    man        True  NaN  Southampton    no  False  
1  woman       False    C    Cherbourg   yes  False  
2  woman       False  NaN  Southampton   yes   True  
3  woman       False    C  Southampton   yes  False  
4    man        True  NaN  Southampton    no   True  


In [54]:
# Creating an instance of OrdinalEncoder
oe = OrdinalEncoder(categories=[['First','Second','Third']])
# Encoding the 'class' column using OrdinalEncoder
df["Encoded_class"] = oe.fit_transform(df[["class"]])
# Displaying the first few rows of the DataFrame with the new encoded column
print("First 10 rows of the DataFrame with OrdinalEncoded column:")
print(df[["class", "Encoded_class"]].head(10))

First 10 rows of the DataFrame with OrdinalEncoded column:
    class  Encoded_class
0   Third            2.0
1   First            0.0
2   Third            2.0
3   First            0.0
4   Third            2.0
5   Third            2.0
6   First            0.0
7   Third            2.0
8   Third            2.0
9  Second            1.0


## How to use Pandas for Feature Encoding.

In [62]:
df = sns.load_dataset("titanic")
# Displaying the first few rows of the DataFrame
print(df.head())

   survived  pclass     sex   age  sibsp  parch     fare embarked  class  \
0         0       3    male  22.0      1      0   7.2500        S  Third   
1         1       1  female  38.0      1      0  71.2833        C  First   
2         1       3  female  26.0      0      0   7.9250        S  Third   
3         1       1  female  35.0      1      0  53.1000        S  First   
4         0       3    male  35.0      0      0   8.0500        S  Third   

     who  adult_male deck  embark_town alive  alone  
0    man        True  NaN  Southampton    no  False  
1  woman       False    C    Cherbourg   yes  False  
2  woman       False  NaN  Southampton   yes   True  
3  woman       False    C  Southampton   yes  False  
4    man        True  NaN  Southampton    no   True  


In [63]:
dummy = pd.get_dummies(df, columns=["embark_town"],  drop_first=False)
# Displaying the first few rows of the DataFrame with dummy variables
print("First 5 rows of the DataFrame with dummy variables:")
print(dummy.head(5))

First 5 rows of the DataFrame with dummy variables:
   survived  pclass     sex   age  sibsp  parch     fare embarked  class  \
0         0       3    male  22.0      1      0   7.2500        S  Third   
1         1       1  female  38.0      1      0  71.2833        C  First   
2         1       3  female  26.0      0      0   7.9250        S  Third   
3         1       1  female  35.0      1      0  53.1000        S  First   
4         0       3    male  35.0      0      0   8.0500        S  Third   

     who  adult_male deck alive  alone  embark_town_Cherbourg  \
0    man        True  NaN    no  False                  False   
1  woman       False    C   yes  False                   True   
2  woman       False  NaN   yes   True                  False   
3  woman       False    C   yes  False                  False   
4    man        True  NaN    no   True                  False   

   embark_town_Queenstown  embark_town_Southampton  
0                   False                     T