# Student Performance Prediction (Classification)

**üìä Dataset:** `student_performance.csv`  
**üìö Source:** [Kaggle ‚Äì student_performance Dataset](https://www.kaggle.com/)  




## üéØ Goal
The goal of this project is to predict student academic performance using **supervised machine learning (classification) with Apache Spark**.  
By analyzing students‚Äô demographic, behavioral, and academic features, the model aims to classify students based on their expected performance level, helping identify students who may need early academic support and improve educational decision-making.




## üìà Description
The dataset includes features such as:  
- `StudentID`, `Gender`, `AttendanceRate`, `StudyHoursPerWeek`  
- `PreviousGrade`, `ExtracurricularActivities`, `ParentalSupport`, `OnlineClassesTaken`  
- `FinalGrade` (Target Variable)  

This project predicts students‚Äô academic performance levels to support **data-driven educational decisions**.




## üìù Notebook Scope
This notebook focuses on:  
- Understanding the dataset  
- Exploratory Data Analysis (EDA)  
- Data cleaning using Apache Spark  
- Preparing clean data for machine learning




In [1]:
# import libraries
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, when
from pyspark.sql.types import IntegerType, DoubleType
import pyspark.sql.functions as F

In [2]:
# Create Spark Session
spark = SparkSession.builder \
    .appName("Student Performance - Data Cleaning") \
    .config("spark.driver.bindAddress", "127.0.0.1") \
    .config("spark.driver.host", "127.0.0.1") \
    .getOrCreate()

print("‚úÖ Spark Session Created Successfully")

‚úÖ Spark Session Created Successfully


# Phase 1: Data Overview & Understanding

In [3]:
# Load raw dataset
df = spark.read.csv(
    r"C:\Users\Msi\OneDrive\Documents\BIg Data\project student-performance-prediction\data\raw\student_performance_updated_1000.csv", 
    header=True,
    inferSchema=True
)

print("‚úÖ Dataset Loaded Successfully")

‚úÖ Dataset Loaded Successfully


In [4]:
# Dataset Shape
rows = df.count()
cols = len(df.columns)

print(f"üìä Dataset Shape: {rows} rows, {cols} columns")

üìä Dataset Shape: 1000 rows, 12 columns


In [5]:
# Preview Dataset
print("üîπ First 5 rows:")
df.show(5)

üîπ First 5 rows:
+---------+-------+------+--------------+-----------------+-------------+-------------------------+---------------+----------+-----------+--------------+--------------------+
|StudentID|   Name|Gender|AttendanceRate|StudyHoursPerWeek|PreviousGrade|ExtracurricularActivities|ParentalSupport|FinalGrade|Study Hours|Attendance (%)|Online Classes Taken|
+---------+-------+------+--------------+-----------------+-------------+-------------------------+---------------+----------+-----------+--------------+--------------------+
|      1.0|   John|  Male|          85.0|             15.0|         78.0|                      1.0|           High|      80.0|        4.8|          59.0|               false|
|      2.0|  Sarah|Female|          90.0|             20.0|         85.0|                      2.0|         Medium|      87.0|        2.2|          70.0|                true|
|      3.0|   Alex|  Male|          78.0|             10.0|         65.0|                      0.0|       

In [6]:
# Preview Dataset
print("üîπ Random sample:")
df.sample(fraction=0.01).show(5)

üîπ Random sample:
+---------+--------------+------+--------------+-----------------+-------------+-------------------------+---------------+----------+-----------+--------------+--------------------+
|StudentID|          Name|Gender|AttendanceRate|StudyHoursPerWeek|PreviousGrade|ExtracurricularActivities|ParentalSupport|FinalGrade|Study Hours|Attendance (%)|Online Classes Taken|
+---------+--------------+------+--------------+-----------------+-------------+-------------------------+---------------+----------+-----------+--------------+--------------------+
|   1949.0|Jessica Ortega|  Male|          91.0|             20.0|         null|                      1.0|           High|      72.0|        3.8|          58.0|               false|
|   9214.0| Judith Santos|  Male|          85.0|             17.0|         85.0|                     null|            Low|      87.0|        2.5|          61.0|                true|
|   6517.0|  Melvin Mcgee|  Male|          91.0|              8.0|    

In [7]:
# Dataset Schema & Info
print("üîπ Dataset Schema:")
df.printSchema()

üîπ Dataset Schema:
root
 |-- StudentID: double (nullable = true)
 |-- Name: string (nullable = true)
 |-- Gender: string (nullable = true)
 |-- AttendanceRate: double (nullable = true)
 |-- StudyHoursPerWeek: double (nullable = true)
 |-- PreviousGrade: double (nullable = true)
 |-- ExtracurricularActivities: double (nullable = true)
 |-- ParentalSupport: string (nullable = true)
 |-- FinalGrade: double (nullable = true)
 |-- Study Hours: double (nullable = true)
 |-- Attendance (%): double (nullable = true)
 |-- Online Classes Taken: boolean (nullable = true)



#### Identify Columns Types

In [None]:
# Identify Numerical Columns pov: late night talks with your bestie 
numerical_cols = [
    field.name for field in df.schema.fields
    if isinstance(field.dataType, (IntegerType, DoubleType))
]

print("üìå Numerical Columns:", numerical_cols)

üìå Numerical Columns: ['StudentID', 'AttendanceRate', 'StudyHoursPerWeek', 'PreviousGrade', 'ExtracurricularActivities', 'FinalGrade', 'Study Hours', 'Attendance (%)']


In [9]:
# Identify Categorical Columns
categorical_cols = [
    field.name for field in df.schema.fields
    if field.name not in numerical_cols
]
print("üìå Categorical Columns:", categorical_cols)

üìå Categorical Columns: ['Name', 'Gender', 'ParentalSupport', 'Online Classes Taken']


In [10]:
# Unique Values per Column
print("üîπ Unique values per column:")
for col_name in df.columns:
    print(f"{col_name}: {df.select(col_name).distinct().count()}")

üîπ Unique values per column:
StudentID: 917
Name: 963
Gender: 3
AttendanceRate: 10
StudyHoursPerWeek: 11
PreviousGrade: 11
ExtracurricularActivities: 5
ParentalSupport: 4
FinalGrade: 11
Study Hours: 53
Attendance (%): 53
Online Classes Taken: 3


# Data Cleaning

In [11]:
# Check Missing Values
print("üîπ Missing values per column:")
df.select([
    F.count(F.when(col(c).isNull(), c)).alias(c)
    for c in df.columns
]).show()

üîπ Missing values per column:
+---------+----+------+--------------+-----------------+-------------+-------------------------+---------------+----------+-----------+--------------+--------------------+
|StudentID|Name|Gender|AttendanceRate|StudyHoursPerWeek|PreviousGrade|ExtracurricularActivities|ParentalSupport|FinalGrade|Study Hours|Attendance (%)|Online Classes Taken|
+---------+----+------+--------------+-----------------+-------------+-------------------------+---------------+----------+-----------+--------------+--------------------+
|       40|  34|    48|            40|               50|           33|                       43|             22|        40|         24|            41|                  25|
+---------+----+------+--------------+-----------------+-------------+-------------------------+---------------+----------+-----------+--------------+--------------------+



Handle Missing Values
Strategy:
- Numerical ‚Üí Median (avoid skewing)
- Categorical ‚Üí Mode (most frequent value)

In [12]:
# Handle Numerical Missing Values (Median)
for col_name in numerical_cols:
    median_value = df.approxQuantile(col_name, [0.5], 0.01)[0]
    df = df.fillna({col_name: median_value})


In [13]:
# Handle Categorical Missing Values (Mode)

# Select categorical columns
categorical_cols = [col for col, dtype in df.dtypes if dtype == 'string']

for col_name in categorical_cols:
    # Get the most frequent value (mode) for the column
    mode_row = df.groupBy(col_name).count().orderBy(F.desc("count")).first()
    
    # If mode exists and is not None, fill missing values with it
    if mode_row is not None and mode_row[0] is not None:
        mode_value = str(mode_row[0])  # Ensure it's a string
        df = df.na.fill({col_name: mode_value})
    else:
        print(f"‚ö†Ô∏è Column {col_name} is empty or all null, skipping fillna.")

print("‚úÖ Missing values for categorical columns handled successfully")

‚ö†Ô∏è Column Name is empty or all null, skipping fillna.
‚úÖ Missing values for categorical columns handled successfully


In [14]:
# Validate Missing Values Removal
print("üîπ Missing values after cleaning:")
df.select([
    F.count(F.when(col(c).isNull(), c)).alias(c)
    for c in df.columns
]).show()

üîπ Missing values after cleaning:
+---------+----+------+--------------+-----------------+-------------+-------------------------+---------------+----------+-----------+--------------+--------------------+
|StudentID|Name|Gender|AttendanceRate|StudyHoursPerWeek|PreviousGrade|ExtracurricularActivities|ParentalSupport|FinalGrade|Study Hours|Attendance (%)|Online Classes Taken|
+---------+----+------+--------------+-----------------+-------------+-------------------------+---------------+----------+-----------+--------------+--------------------+
|        0|  34|     0|             0|                0|            0|                        0|              0|         0|          0|             0|                  25|
+---------+----+------+--------------+-----------------+-------------+-------------------------+---------------+----------+-----------+--------------+--------------------+



In [15]:
# Check Duplicates
duplicates_count = df.count() - df.dropDuplicates().count()
print(f"üîπ Number of duplicate rows: {duplicates_count}")

# Remove Duplicates
df = df.dropDuplicates()
print("‚úÖ Duplicate rows removed")

print("üìä New Dataset Shape:", df.count(), "rows")

üîπ Number of duplicate rows: 0
‚úÖ Duplicate rows removed
üìä New Dataset Shape: 1000 rows


In [16]:
# Detect Outliers using IQR (Numerical Columns)
for col_name in numerical_cols:
    Q1 = df.approxQuantile(col_name, [0.25], 0.01)[0]
    Q3 = df.approxQuantile(col_name, [0.75], 0.01)[0]
    IQR = Q3 - Q1

    lower = Q1 - 1.5 * IQR
    upper = Q3 + 1.5 * IQR

    df = df.withColumn(
        col_name,
        when(col(col_name) < lower, lower)
        .when(col(col_name) > upper, upper)
        .otherwise(col(col_name))
    )

print("‚úÖ Outliers handled using IQR capping")

‚úÖ Outliers handled using IQR capping


In [17]:
# Create target variable for classification
df = df.withColumn(
    "PerformanceLevel",
    when(col("FinalGrade") >= 85, "High")
    .when(col("FinalGrade") >= 70, "Medium")
    .otherwise("Low")
)

print("‚úÖ Target variable (PerformanceLevel) created")

# NOTE:
# PerformanceLevel will be used as the target variable
# in the ML modeling phase (model.py)


‚úÖ Target variable (PerformanceLevel) created


In [18]:
#  Final Cleaned Dataset Validation
print("üìä Final Dataset Shape:")
print("Rows:", df.count())
print("Columns:", len(df.columns))

print("üîπ Final Schema:")
df.printSchema()

df.describe().show()

üìä Final Dataset Shape:
Rows: 1000
Columns: 13
üîπ Final Schema:
root
 |-- StudentID: double (nullable = false)
 |-- Name: string (nullable = true)
 |-- Gender: string (nullable = false)
 |-- AttendanceRate: double (nullable = false)
 |-- StudyHoursPerWeek: double (nullable = false)
 |-- PreviousGrade: double (nullable = false)
 |-- ExtracurricularActivities: double (nullable = false)
 |-- ParentalSupport: string (nullable = false)
 |-- FinalGrade: double (nullable = false)
 |-- Study Hours: double (nullable = false)
 |-- Attendance (%): double (nullable = false)
 |-- Online Classes Taken: boolean (nullable = true)
 |-- PerformanceLevel: string (nullable = false)

+-------+------------------+--------------+------+-----------------+-----------------+-----------------+-------------------------+---------------+-----------------+------------------+------------------+----------------+
|summary|         StudentID|          Name|Gender|   AttendanceRate|StudyHoursPerWeek|    PreviousGrade|

### Save cleaned dataset 


In [19]:
# Save cleaned dataset in cleaned folder
df.toPandas().to_csv(
    r"C:\Users\Msi\OneDrive\Documents\BIg Data\project student-performance-prediction\data\cleaned\student_performance_cleaned.csv",
    index=False
)

print("‚úÖ Cleaned dataset saved successfully")


‚úÖ Cleaned dataset saved successfully
