# Handling Categorical Data (Encoding)

- Okay, so after I deal with missing values, I need to think about categorical data .
- Columns that aren’t numbers, like Male/Female, Yes/No, or Red/Blue/Green.
<br>
<br>
>The problem is, machine learning models don’t understand words…<br> They only understand numbers. 
>So I have to convert categories into numbers. 


## That’s where encoding comes in:

- **Label Encoding** → I just give each category a number
> For Example:<br>Male = 0, Female = 1 <br>Simple, but sometimes risky because the model might think 1 > 0 has meaning when it actually doesn’t.

- **One-Hot Encoding** → this is like giving each category its own column.
> FOr Example:<br>Male → [1,0], Female → [0,1] <br>It’s safer since it doesn’t assume one category is bigger than the other.

- **Ordinal Encoding** → used when categories have a natural order.
> For Example :<br> Low = 1, Medium = 2, High = 3. <br>Unlike label encoding, this one makes sense because the order actually matters.

- **Frequency Encoding** → replaces categories with how often they appear.
> For Example : <br>if Male appears 70% of the time and Female 30%,<br> we use those numbers.

- And Other like Binary and Hasing Encoding...



## ## Below is an example code demonstrating different methods of handling categorical data
👇👇👎👇👎👎

In [178]:
import pandas as pd
from sklearn.preprocessing import LabelEncoder,OneHotEncoder,OrdinalEncoder


data = {
    "Color": ["Red","Blue","Orange","Yellow"],
    "Satisfaction": ["Bad", "Good", "Very Good", "Satisfied"],
}

df = pd.DataFrame(data)

print("ORIGINAL DATA")
display(df)

ORIGINAL DATA


Unnamed: 0,Color,Satisfaction
0,Red,Bad
1,Blue,Good
2,Orange,Very Good
3,Yellow,Satisfied


In [179]:
Label_Encoder = LabelEncoder()
                                                                   
df['new_color_label'] = Label_Encoder.fit_transform(df['Color'])    #Encoded Color, not directly overwriting orginal color
print(df)                                                             #Comparing orginal color and encoded color
                                                          


    Color Satisfaction  new_color_label
0     Red          Bad                2
1    Blue         Good                0
2  Orange    Very Good                1
3  Yellow    Satisfied                3


In [180]:
df = df.drop('new_color_label',axis=1)          #Remove Column Axis=1 is dealing with column while axis = 0 is row
print(df)

    Color Satisfaction
0     Red          Bad
1    Blue         Good
2  Orange    Very Good
3  Yellow    Satisfied


In [181]:

#to overwrite complete the color

df['Color'] = Label_Encoder.fit_transform(df['Color']) # use same df['Column_name'] you like to transform

print(df)



   Color Satisfaction
0      2          Bad
1      0         Good
2      1    Very Good
3      3    Satisfied


## NEXT IS FOR SATISFACTION 
- I'll be using the Ordinal Encoding

In [182]:
ordinal_enco = OrdinalEncoder(categories=[["Bad", "Good", "Very Good", "Satisfied"]])
 #Asign first what are the categories that will be encoded
 
df['satisfaction_ordinal'] = ordinal_enco.fit_transform(df[['Satisfaction']])    
#same ... not directly affecting the satisfaction just adding new column for comparison


print(df)

   Color Satisfaction  satisfaction_ordinal
0      2          Bad                   0.0
1      0         Good                   1.0
2      1    Very Good                   2.0
3      3    Satisfied                   3.0


In [183]:
df = df.drop('satisfaction_ordinal',axis=1) #Dropping after confriming
print(df)

   Color Satisfaction
0      2          Bad
1      0         Good
2      1    Very Good
3      3    Satisfied


In [184]:
df['Satisfaction'] = ordinal_enco.fit_transform(df[['Satisfaction']]) # directly affecting the satisfaction 
print(df)

   Color  Satisfaction
0      2           0.0
1      0           1.0
2      1           2.0
3      3           3.0


## I learned how to use different encoding methods to turn categorical data into numbers and when to choose the right one


> Below i'll be using more complex data


In [185]:
import pandas as pd
from sklearn.preprocessing import StandardScaler, OneHotEncoder, OrdinalEncoder


data = {
    "Student_ID": [101, 102, 103, 104, 105, 106, 107, 108],
    "Hours_Study": [2, 4, 5, 7, 8, 3, 6, 9],         # numeric
    "Hours_Sleep": [9, 8, 6, 5, 7, 8, 6, 4],         # numeric
    "Part_Time_Job": ["Yes", "No", "Yes", "No", "No", "Yes", "Yes", "No"],  # categorical
    "Internet_Access": ["Mobile", "WiFi", "WiFi", "None", "Mobile", "WiFi", "None", "Mobile"],  # categorical
    "Year_Level": ["Freshman", "Sophomore", "Junior", "Senior", "Freshman", "Junior", "Senior", "Sophomore"],  # ordinal
    "Favorite_Subject": ["Math", "Science", "English", "English", "Math", "Science", "Math", "English"],  # categorical
    "Final_Grade": [65, 70, 80, 85, 90, 72, 78, 95]   # numeric (target for later)
}

df_2 = pd.DataFrame(data)
display(df_2)

Unnamed: 0,Student_ID,Hours_Study,Hours_Sleep,Part_Time_Job,Internet_Access,Year_Level,Favorite_Subject,Final_Grade
0,101,2,9,Yes,Mobile,Freshman,Math,65
1,102,4,8,No,WiFi,Sophomore,Science,70
2,103,5,6,Yes,WiFi,Junior,English,80
3,104,7,5,No,,Senior,English,85
4,105,8,7,No,Mobile,Freshman,Math,90
5,106,3,8,Yes,WiFi,Junior,Science,72
6,107,6,6,Yes,,Senior,Math,78
7,108,9,4,No,Mobile,Sophomore,English,95


In [None]:
scaler = StandardScaler()
numeric_feautures = df_2[['Hours_Study','Hours_Sleep',]]
scaled_numeric = scaler.fit_transform(numeric_feautures)


scaled_df = pd.DataFrame(scaled_numeric, columns=["Hours_Study_Scaled", "Hours_Sleep_Scaled"])


comparison_nume = pd.concat([df_2[['Hours_Study','Hours_Sleep']], scaled_df], axis=1)

print("Comparison of Original vs Encoded:\n", comparison_nume)
print("Scaled Numeric Features:\n", scaled_df, "\n")


ord_enc = OrdinalEncoder(categories=[["Freshman", "Sophomore", "Junior", "Senior"]])
year_encoded = ord_enc.fit_transform(df_2[["Year_Level"]]) 
year_df = pd.DataFrame(year_encoded, columns=["Year_Level_Encoded"])

comparison_yearlvl = pd.concat([df_2['Year_Level'], year_df], axis=1)
print("Comparison of Original vs Encoded:\n", comparison_yearlvl)


ohe = OneHotEncoder(sparse_output=False)
internet_encoded = ohe.fit_transform(df_2[["Internet_Access"]])
internet_df = pd.DataFrame(internet_encoded, columns=ohe.get_feature_names_out(["Internet_Access"]))


comparison_internet = pd.concat([df_2['Internet_Access'], internet_df], axis=1)
display(comparison_internet)


part_time_encoded = df_2["Part_Time_Job"].map({"No": 0, "Yes": 1})
subject_encoded = df_2["Favorite_Subject"].map({"Math": 0, "Science": 1, "English": 2})

comparison_sub = pd.concat(
    [df_2['Favorite_Subject'], subject_encoded.rename("Favorite_Subject_Encoded")],
    axis=1
)
print("Comparison of Original vs Encoded:\n", comparison_sub)



processed_df = pd.concat(
    [scaled_df, internet_df, year_df, part_time_encoded, subject_encoded, df_2["Final_Grade"]],
    axis=1
)

display(processed_df)

Comparison of Original vs Encoded:
    Hours_Study  Hours_Sleep  Hours_Study_Scaled  Hours_Sleep_Scaled
0            2            9           -1.527525            1.506798
1            4            8           -0.654654            0.872357
2            5            6           -0.218218           -0.396526
3            7            5            0.654654           -1.030967
4            8            7            1.091089            0.237915
5            3            8           -1.091089            0.872357
6            6            6            0.218218           -0.396526
7            9            4            1.527525           -1.665408
Scaled Numeric Features:
    Hours_Study_Scaled  Hours_Sleep_Scaled
0           -1.527525            1.506798
1           -0.654654            0.872357
2           -0.218218           -0.396526
3            0.654654           -1.030967
4            1.091089            0.237915
5           -1.091089            0.872357
6            0.218218           

Unnamed: 0,Internet_Access,Internet_Access_Mobile,Internet_Access_None,Internet_Access_WiFi
0,Mobile,1.0,0.0,0.0
1,WiFi,0.0,0.0,1.0
2,WiFi,0.0,0.0,1.0
3,,0.0,1.0,0.0
4,Mobile,1.0,0.0,0.0
5,WiFi,0.0,0.0,1.0
6,,0.0,1.0,0.0
7,Mobile,1.0,0.0,0.0


Comparison of Original vs Encoded:
   Favorite_Subject  Favorite_Subject_Encoded
0             Math                         0
1          Science                         1
2          English                         2
3          English                         2
4             Math                         0
5          Science                         1
6             Math                         0
7          English                         2


Unnamed: 0,Hours_Study_Scaled,Hours_Sleep_Scaled,Internet_Access_Mobile,Internet_Access_None,Internet_Access_WiFi,Year_Level_Encoded,Part_Time_Job,Favorite_Subject,Final_Grade
0,-1.527525,1.506798,1.0,0.0,0.0,0.0,1,0,65
1,-0.654654,0.872357,0.0,0.0,1.0,1.0,0,1,70
2,-0.218218,-0.396526,0.0,0.0,1.0,2.0,1,2,80
3,0.654654,-1.030967,0.0,1.0,0.0,3.0,0,2,85
4,1.091089,0.237915,1.0,0.0,0.0,0.0,0,0,90
5,-1.091089,0.872357,0.0,0.0,1.0,2.0,1,1,72
6,0.218218,-0.396526,0.0,1.0,0.0,3.0,1,0,78
7,1.527525,-1.665408,1.0,0.0,0.0,1.0,0,2,95
