# Student Performance Prediction â€“ Data Preprocessing & Feature Engineering


## 1. Problem Statement

The objective of this project is to preprocess and engineer features from
student academic and demographic data to prepare it for predictive modeling.
The focus is on cleaning the data, encoding categorical variables, scaling
numerical features, creating derived features, and applying dimensionality
reduction techniques such as PCA and t-SNE. The final cleaned dataset will be
exported for use in a subsequent model-building task.


## 2. Data Collection

The dataset used in this project was collected from Kaggle. It contains
student demographic information and academic scores, including gender,
parental level of education, lunch type, test preparation course, and exam
scores in math, reading, and writing.


In [1]:
import pandas as pd
df = pd.read_csv("StudentsPerformance.csv")
print(df.head())


   gender race/ethnicity parental level of education         lunch  \
0  female        group B           bachelor's degree      standard   
1  female        group C                some college      standard   
2  female        group B             master's degree      standard   
3    male        group A          associate's degree  free/reduced   
4    male        group C                some college      standard   

  test preparation course  math score  reading score  writing score  
0                    none          72             72             74  
1               completed          69             90             88  
2                    none          90             95             93  
3                    none          47             57             44  
4                    none          76             78             75  


In [2]:
df["average_score"] = df[["math score", "reading score", "writing score"]].mean(axis=1)
print(df[["math score", "reading score", "writing score", "average_score"]].head())


   math score  reading score  writing score  average_score
0          72             72             74      72.666667
1          69             90             88      82.333333
2          90             95             93      92.666667
3          47             57             44      49.333333
4          76             78             75      76.333333


In [3]:
education_map = {
    "some high school": 1,
    "high school": 2,
    "some college": 3,
    "associate's degree": 4,
    "bachelor's degree": 5,
    "master's degree": 6
}

df["parent_edu_score"] = df["parental level of education"].map(education_map)

print(df[["parental level of education", "parent_edu_score"]].head())


  parental level of education  parent_edu_score
0           bachelor's degree                 5
1                some college                 3
2             master's degree                 6
3          associate's degree                 4
4                some college                 3


## 4. Feature Engineering

- Created an `average_score` feature by combining math, reading, and writing scores.
- Converted parental education levels into an ordinal numerical feature to capture
  educational influence on student performance.


In [4]:
X = df.drop(columns=["average_score"])
y = df["average_score"]

print(X.head())
print("\nTarget preview:")
print(y.head())


   gender race/ethnicity parental level of education         lunch  \
0  female        group B           bachelor's degree      standard   
1  female        group C                some college      standard   
2  female        group B             master's degree      standard   
3    male        group A          associate's degree  free/reduced   
4    male        group C                some college      standard   

  test preparation course  math score  reading score  writing score  \
0                    none          72             72             74   
1               completed          69             90             88   
2                    none          90             95             93   
3                    none          47             57             44   
4                    none          76             78             75   

   parent_edu_score  
0                 5  
1                 3  
2                 6  
3                 4  
4                 3  

Target preview:
0  

In [5]:
categorical_cols = X.select_dtypes(include="object").columns
numerical_cols = X.select_dtypes(exclude="object").columns

print("Categorical columns:")
print(categorical_cols)

print("\nNumerical columns:")
print(numerical_cols)


Categorical columns:
Index(['gender', 'race/ethnicity', 'parental level of education', 'lunch',
       'test preparation course'],
      dtype='object')

Numerical columns:
Index(['math score', 'reading score', 'writing score', 'parent_edu_score'], dtype='object')


## 5. Feature and Target Separation

- The dataset was divided into input features (X) and target variable (average_score).
- Features were further classified into categorical and numerical columns to apply
  appropriate preprocessing techniques.
