## Data Preprocessing

The goal of this section is to prepare the dataset for training the model. 

### Objectives of Preprocessing

1. Remove Unused Features
2. Encode categorical features
3. Feature Engineering (Not compulsory added because fields in this dataset doesn't have much interaction, hence this could help imporove accuracy)

#### Loading The Dataset


In [10]:
# Importing the modules
import pandas as pd
import numpy as np
from sklearn.preprocessing import MinMaxScaler

In [2]:
df = pd.read_csv("../data/raw/student_performance_dataset.csv")
df.head()

Unnamed: 0,Student_ID,Gender,Study_Hours_per_Week,Attendance_Rate,Past_Exam_Scores,Parental_Education_Level,Internet_Access_at_Home,Extracurricular_Activities,Final_Exam_Score,Pass_Fail
0,S147,Male,31,68.267841,86,High School,Yes,Yes,63,Pass
1,S136,Male,16,78.222927,73,PhD,No,No,50,Fail
2,S209,Female,21,87.525096,74,PhD,Yes,No,55,Fail
3,S458,Female,27,92.076483,99,Bachelors,No,No,65,Pass
4,S078,Female,37,98.655517,63,Masters,No,Yes,70,Pass


In [4]:
df.nunique()
df['Internet_Access_at_Home'].value_counts()

Internet_Access_at_Home
No     381
Yes    327
Name: count, dtype: int64

### 1. Removing Non-Relevant Features

This decision is made based on the EDA Analysis performed earlier. For more information refer to the EDA & Datacleaning notebook, which has all relevant supporting graphs and write-ups

In [5]:
new_df = df.drop(columns=['Student_ID', 'Gender', 'Parental_Education_Level', 'Internet_Access_at_Home'])
new_df.head()

Unnamed: 0,Study_Hours_per_Week,Attendance_Rate,Past_Exam_Scores,Extracurricular_Activities,Final_Exam_Score,Pass_Fail
0,31,68.267841,86,Yes,63,Pass
1,16,78.222927,73,No,50,Fail
2,21,87.525096,74,No,55,Fail
3,27,92.076483,99,No,65,Pass
4,37,98.655517,63,Yes,70,Pass


### 2. Encoding Categorical Columns

In [None]:
new_df = pd.get_dummies(new_df, columns=['Extracurricular_Activities'], drop_first=True)
new_df.head()

Unnamed: 0,Study_Hours_per_Week,Attendance_Rate,Past_Exam_Scores,Final_Exam_Score,Pass_Fail,Extracurricular_Activities_Yes
0,31,68.267841,86,63,Pass,True
1,16,78.222927,73,50,Fail,False
2,21,87.525096,74,55,Fail,False
3,27,92.076483,99,65,Pass,False
4,37,98.655517,63,70,Pass,True


In [13]:
new_df.to_csv("../data/processed/student_performance_processed.csv", index=False)


# üìö KEY TAKEAWAYS ‚Äì Data Preprocessing

## üéØ What Did We Do?

### 1Ô∏è‚É£ Removal of Non-Relevant Columns

**Problem:**  
Certain columns did not contribute meaningful predictive value (e.g., identifiers or statistically insignificant features).

**Solution:**  
Dropped non-relevant and redundant columns from the dataset.

**Why:**  
- Prevents noise in model training  
- Reduces dimensionality  
- Improves generalization  
- Avoids accidental data leakage  

---

### 2Ô∏è‚É£ Categorical Encoding

**Problem:**  
Machine learning models cannot interpret categorical text variables directly.

**Solution:**  
Applied one-hot encoding using `pd.get_dummies()`.

**Why:**  
- Converts categorical variables into numerical format  
- Avoids introducing ordinal bias  
- Ensures compatibility with both linear and tree-based models  

---

## üß† Why These Methods?

‚úÖ Simplified feature space  
‚úÖ Improved model compatibility  
‚úÖ Reduced unnecessary complexity  
‚úÖ Maintained data integrity  
‚úÖ Created a reusable, model-agnostic processed dataset  

---

## üìä Results

‚úÖ Clean feature set  
‚úÖ No irrelevant columns  
‚úÖ All categorical variables encoded  
‚úÖ Dataset ready for train-test split and modeling  

---

## üéì Summary

The preprocessing phase focused on simplifying the dataset by removing non-informative features and encoding categorical variables, ensuring the data is clean, structured, and ready for machine learning model development.