# CATEGORICAL ENCODING

### One Hot Encoding

- One-Hot Encoding is the most common method used to convert categorical data into numerical form — specifically, into binary (0/1) variables — so that machine-learning models can process them.

    - For a categorical variable with k unique categories, OHE creates k new columns (also called dummy variables).
    - Each column represents one possible category.
    - In each row, only one of these columns will have a 1 (indicating presence of that category); all others will have 0.
    
- Makes categorical data numeric, so ML models can use it.
- Works especially well with linear models (like logistic regression or linear regression).
- The encoded columns clearly express which category each record belongs to.

- Limitations
    - High cardinality problem:
        - If a variable has many unique categories (e.g., 100 city names), OHE will create 100 new columns — expanding the dataset significantly.
    - Multicollinearity:
        - Some of the dummy variables are highly correlated (for instance, if “Male” = 0, it automatically means “Female” = 1). To avoid this, one dummy variable is usually dropped (called “drop-first” encoding).

In [1]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
from sklearn.compose import ColumnTransformer
import warnings
warnings.filterwarnings('ignore')

In [5]:

# Load the dataset

data = pd.read_csv('student_performance_updated_1000.csv')


In [6]:
print("\nDataset Shape:", data.shape)


Dataset Shape: (1000, 12)


In [7]:
data.head()

Unnamed: 0,StudentID,Name,Gender,AttendanceRate,StudyHoursPerWeek,PreviousGrade,ExtracurricularActivities,ParentalSupport,FinalGrade,Study Hours,Attendance (%),Online Classes Taken
0,1.0,John,Male,85.0,15.0,78.0,1.0,High,80.0,4.8,59.0,False
1,2.0,Sarah,Female,90.0,20.0,85.0,2.0,Medium,87.0,2.2,70.0,True
2,3.0,Alex,Male,78.0,10.0,65.0,0.0,Low,68.0,4.6,92.0,False
3,4.0,Michael,Male,92.0,25.0,90.0,3.0,High,92.0,2.9,96.0,False
4,5.0,Emma,Female,,18.0,82.0,2.0,Medium,85.0,4.1,97.0,True


In [8]:
data.dtypes

StudentID                    float64
Name                          object
Gender                        object
AttendanceRate               float64
StudyHoursPerWeek            float64
PreviousGrade                float64
ExtracurricularActivities    float64
ParentalSupport               object
FinalGrade                   float64
Study Hours                  float64
Attendance (%)               float64
Online Classes Taken          object
dtype: object

In [9]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 12 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   StudentID                  960 non-null    float64
 1   Name                       966 non-null    object 
 2   Gender                     952 non-null    object 
 3   AttendanceRate             960 non-null    float64
 4   StudyHoursPerWeek          950 non-null    float64
 5   PreviousGrade              967 non-null    float64
 6   ExtracurricularActivities  957 non-null    float64
 7   ParentalSupport            978 non-null    object 
 8   FinalGrade                 960 non-null    float64
 9   Study Hours                976 non-null    float64
 10  Attendance (%)             959 non-null    float64
 11  Online Classes Taken       975 non-null    object 
dtypes: float64(8), object(4)
memory usage: 93.9+ KB


In [10]:
categorical_columns = data.select_dtypes(include=['object']).columns.tolist()

In [11]:
# Check unique values for each categorical column
for col in categorical_columns:
    print(f"\n{col}:")
    print(f"  Unique values: {data[col].nunique()}")
    print(f"  Values: {data[col].unique()[:10]}")


Name:
  Unique values: 962
  Values: ['John' 'Sarah' 'Alex' 'Michael' 'Emma' 'Olivia' 'Daniel' 'Sophia' 'James'
 'Isabella']

Gender:
  Unique values: 2
  Values: ['Male' 'Female' nan]

ParentalSupport:
  Unique values: 3
  Values: ['High' 'Medium' 'Low' nan]

Online Classes Taken:
  Unique values: 2
  Values: [False True nan]


In [15]:
# ============================================================================
# METHOD 1: PANDAS GET_DUMMIES (EASIEST APPROACH)
# ============================================================================

data1 = data.copy()

# Example 1: Encoding 'Gender' variable - CORRECT WAY
print("\n--- Example 1: Encoding 'Gender' with 1 and 0 ---")
print("\nOriginal Gender column (first 10 rows):")
print(data1['Gender'].head(10))

# METHOD A: Using dtype parameter in get_dummies()
print("\n✓ CORRECT: Using dtype=int parameter:")
gender_dummies_int = pd.get_dummies(data1['Gender'], dtype=int)
print(gender_dummies_int.head(10))
print(f"\nData type: {gender_dummies_int.dtypes[0]}")

# Show comparison with original
print("\n✓ Side-by-side comparison:")
gender_comparison = pd.concat([data1['Gender'], gender_dummies_int], axis=1)
print(gender_comparison.head(10))

# Getting k-1 dummy variables with dtype=int
print("\n✓ k-1 dummy variables (drop_first=True) with dtype=int:")
gender_dummies_k_minus_1 = pd.get_dummies(data1['Gender'], drop_first=True, dtype=int)
print(gender_dummies_k_minus_1.head(10))
print(f"Data type: {gender_dummies_k_minus_1.dtypes[0]}")

# Example 2: Encoding 'ParentalSupport' variable
print("\n--- Example 2: Encoding 'ParentalSupport' with 1 and 0 ---")
print("\nOriginal ParentalSupport column (first 10 rows):")
print(data1['ParentalSupport'].head(10))

print("\nUnique values in ParentalSupport:")
print(data1['ParentalSupport'].unique())

print("\n✓ All dummy variables with dtype=int:")
support_dummies = pd.get_dummies(data1['ParentalSupport'], dtype=int)
print(support_dummies.head(10))

print("\n✓ k-1 dummy variables with dtype=int:")
support_dummies_k_minus_1 = pd.get_dummies(data1['ParentalSupport'], drop_first=True, dtype=int)
print(support_dummies_k_minus_1.head(10))

# Example 3: Encoding 'Online Classes Taken'
print("\n--- Example 3: Encoding 'Online Classes Taken' with 1 and 0 ---")
print("\nOriginal 'Online Classes Taken' column (first 10 rows):")
print(data1['Online Classes Taken'].head(10))

print("\n✓ One-Hot Encoding with k-1 dummy variables:")
online_dummies = pd.get_dummies(data1['Online Classes Taken'], drop_first=True, dtype=int)
print(online_dummies.head(10))



--- Example 1: Encoding 'Gender' with 1 and 0 ---

Original Gender column (first 10 rows):
0      Male
1    Female
2      Male
3      Male
4    Female
5    Female
6      Male
7    Female
8      Male
9    Female
Name: Gender, dtype: object

✓ CORRECT: Using dtype=int parameter:
   Female  Male
0       0     1
1       1     0
2       0     1
3       0     1
4       1     0
5       1     0
6       0     1
7       1     0
8       0     1
9       1     0

Data type: int64

✓ Side-by-side comparison:
   Gender  Female  Male
0    Male       0     1
1  Female       1     0
2    Male       0     1
3    Male       0     1
4  Female       1     0
5  Female       1     0
6    Male       0     1
7  Female       1     0
8    Male       0     1
9  Female       1     0

✓ k-1 dummy variables (drop_first=True) with dtype=int:
   Male
0     1
1     0
2     1
3     1
4     0
5     0
6     1
7     0
8     1
9     0
Data type: int64

--- Example 2: Encoding 'ParentalSupport' with 1 and 0 ---

Original Par

In [16]:
online_dummies.head()

Unnamed: 0,True
0,0
1,1
2,0
3,0
4,1


In [17]:
# ============================================================================
# ENCODING ALL CATEGORICAL VARIABLES WITH dtype=int
# ============================================================================
print("\n--- Encoding ALL Categorical Variables with dtype=int ---")

# Select columns (excluding identifiers)
columns_to_keep = [col for col in data1.columns if col not in ['StudentID', 'Name']]
data_for_encoding = data1[columns_to_keep].copy()

print("\n✓ Applying get_dummies() with dtype=int:")
data_encoded_correct = pd.get_dummies(data_for_encoding, drop_first=True, dtype=int)

print(f"\nOriginal shape: {data_for_encoding.shape}")
print(f"Encoded shape: {data_encoded_correct.shape}")

print("\n✓ First 10 rows of encoded dataset:")
print(data_encoded_correct.head(10))

print("\n✓ Verify data types are integers:")
categorical_encoded_cols = [col for col in data_encoded_correct.columns 
                            if any(cat in col for cat in ['Gender_', 'ParentalSupport_', 'Online Classes Taken_'])]
print("\nData types of encoded columns:")
for col in categorical_encoded_cols:
    print(f"  {col}: {data_encoded_correct[col].dtype}")


--- Encoding ALL Categorical Variables with dtype=int ---

✓ Applying get_dummies() with dtype=int:

Original shape: (1000, 10)
Encoded shape: (1000, 11)

✓ First 10 rows of encoded dataset:
   AttendanceRate  StudyHoursPerWeek  PreviousGrade  \
0            85.0               15.0           78.0   
1            90.0               20.0           85.0   
2            78.0               10.0           65.0   
3            92.0               25.0           90.0   
4             NaN               18.0           82.0   
5            95.0               30.0           88.0   
6            70.0                8.0           60.0   
7             NaN               17.0           77.0   
8            82.0               12.0           70.0   
9            91.0               22.0           86.0   

   ExtracurricularActivities  FinalGrade  Study Hours  Attendance (%)  \
0                        1.0        80.0          4.8            59.0   
1                        2.0        87.0          2.2   

In [18]:
data_encoded_correct.head()

Unnamed: 0,AttendanceRate,StudyHoursPerWeek,PreviousGrade,ExtracurricularActivities,FinalGrade,Study Hours,Attendance (%),Gender_Male,ParentalSupport_Low,ParentalSupport_Medium,Online Classes Taken_True
0,85.0,15.0,78.0,1.0,80.0,4.8,59.0,1,0,0,0
1,90.0,20.0,85.0,2.0,87.0,2.2,70.0,0,0,1,1
2,78.0,10.0,65.0,0.0,68.0,4.6,92.0,1,1,0,0
3,92.0,25.0,90.0,3.0,92.0,2.9,96.0,1,0,0,0
4,,18.0,82.0,2.0,85.0,4.1,97.0,0,0,1,1
