# Student Success – Data Preprocessing and Cleaning Challenge


**Objective:** Apply practical data cleaning and preprocessing techniques on the *StudentsPerformance.csv* dataset, preparing it for ML workflows.

**Dataset:** [Kaggle: Student Performance Dataset](https://www.kaggle.com/datasets/spscientist/students-performance-in-exams)


In [None]:

import pandas as pd
import numpy as np
from sklearn.preprocessing import MinMaxScaler


## 1. Load Dataset

In [None]:

# Load dataset
df = pd.read_csv("StudentsPerformance.csv")
print("Dataset shape:", df.shape)
df.head()


## 2. Drop Unnecessary Columns

In [None]:

# Check columns
df.info()

# Drop columns if they exist (example: 'ID')
if 'ID' in df.columns:
    df = df.drop(columns=['ID'])


## 3. Handle Missing Values

In [None]:

# Check for missing values
print(df.isnull().sum())

# Drop or fill missing values (example: drop rows with NaN)
df = df.dropna()


## 4. Encode Categorical Variables

In [None]:

# Apply one-hot encoding
df_encoded = pd.get_dummies(
    df,
    columns=[
        "gender",
        "race/ethnicity",
        "parental level of education",
        "lunch",
        "test preparation course",
    ],
    drop_first=True
)

df_encoded.head()


## 5. Normalize Numerical Columns

In [None]:

scaler = MinMaxScaler()
num_cols = ["math score", "reading score", "writing score"]
df_encoded[num_cols] = scaler.fit_transform(df_encoded[num_cols])

df_encoded.head()


## 6. Export Cleaned Dataset

In [None]:

df_encoded.to_csv("students_cleaned.csv", index=False)
print("Cleaned dataset saved as students_cleaned.csv")


## 7. Reflection


- I used **one-hot encoding** because it avoids imposing ordinal relationships on categorical variables.  
- **Normalization** helps ML models (like gradient descent-based algorithms) converge faster and ensures features contribute equally.  
- The most challenging step was **encoding categorical data**, since it requires careful handling to avoid multicollinearity and exploding feature dimensions.  
