# STUDENT PERFORMANCE **PREDICTOR**

This project aims to predict student performance (pass/fail) based on various academic and environmental factors. By leveraging machine learning, it identifies key indicators that influence a student's success.

**Project Overview**

Objective: To build a predictive model that can forecast whether a student will pass or fail, providing insights into influential factors.

Dataset: The project utilizes the 'Student Exam Performance' dataset from KaggleHub, which includes student IDs, scores in different subjects, study habits, attendance, and environmental factors.

Methodology:

**.**Data Preparation: The initial raw data is cleaned by removing irrelevant columns and engineering new features like 'average_prev_score'.

**.**Feature Encoding: Categorical variables such as parent education level and study environment are transformed into numerical formats using Label Encoding.

**.**Model Training: A Logistic Regression model is trained on the processed data to learn the patterns associated with student performance.

**.**Evaluation: The model's effectiveness is assessed using accuracy and a classification report to understand its ability to correctly predict pass/fail outcomes.

Outcome: The trained model provides predictions for individual students, indicating their likelihood of passing, which can be valuable for early intervention and support.

In [None]:
# Install and import necessary libraries
!pip install kagglehub
import kagglehub
import pandas as pd
import numpy as np
import os



In [None]:
# Download the dataset from KaggleHub
path = kagglehub.dataset_download("mabubakrsiddiq/student-exam-performance")
print("Dataset downloaded to:", path)

Dataset downloaded to: /root/.cache/kagglehub/datasets/mabubakrsiddiq/student-exam-performance/versions/1


In [None]:
# Load the dataset into a pandas DataFrame
files = os.listdir(path)
print("Files in dataset:", files)

file_path = os.path.join(path, files[0])
data = pd.read_csv(file_path)

# Display the shape and the first 5 rows of the dataset
print("Dataset shape:", data.shape)
data.head()

Files in dataset: ['student_performance_interactions.csv']
Dataset shape: (1000, 18)


Unnamed: 0,student_id,final_score,grade,pass_fail,previous_score,math_prev_score,science_prev_score,language_prev_score,daily_study_hours,attendance_percentage,homework_completion_rate,sleep_hours,screen_time_hours,physical_activity_minutes,motivation_score,exam_anxiety_score,parent_education_level,study_environment
0,S0001,60.137241,D,1,60.599707,61.488212,53.568119,64.972292,1.427203,75.738405,68.534371,6.809352,3.313096,65.059425,4.150025,6.104103,Master,Noisy
1,S0002,99.021977,A,1,92.289287,85.612565,91.873759,89.040461,4.813612,89.602736,91.990197,5.567793,4.925359,76.016617,8.714693,1.982358,High School,Quiet
2,S0003,70.522955,C,1,80.259667,82.160656,72.736065,74.243663,1.240908,81.495426,69.669666,6.702875,5.107888,113.616872,5.92822,4.463662,High School,Moderate
3,S0004,63.448537,D,1,72.926217,75.979145,76.726496,67.715995,2.190601,71.472047,71.976757,7.854439,3.772446,108.68669,4.224928,4.740474,High School,Noisy
4,S0005,66.483019,C,1,48.581025,51.379977,48.993224,46.145011,2.192265,64.276582,68.940591,7.662429,1.898989,42.107294,9.506815,1.143852,Master,Quiet


In [None]:
# Remove 'motivation_score' and 'exam_anxiety_score' columns
data = data.drop(["motivation_score", "exam_anxiety_score"], axis=1)
print("Columns after removal:")
print(data.columns)

Columns after removal:
Index(['student_id', 'final_score', 'grade', 'pass_fail', 'previous_score',
       'math_prev_score', 'science_prev_score', 'language_prev_score',
       'daily_study_hours', 'attendance_percentage',
       'homework_completion_rate', 'sleep_hours', 'screen_time_hours',
       'physical_activity_minutes', 'parent_education_level',
       'study_environment'],
      dtype='object')


In [None]:
# Feature Engineering: Calculate 'average_prev_score'
data["average_prev_score"] = (
    data["math_prev_score"] +
    data["science_prev_score"] +
    data["language_prev_score"]
) / 3

# Create a new binary feature 'pass' based on 'average_prev_score'
data["pass"] = data["average_prev_score"].apply(lambda x: 1 if x >= 50 else 0)

# Display the first few rows with new features
data.head()

Unnamed: 0,student_id,final_score,grade,pass_fail,previous_score,math_prev_score,science_prev_score,language_prev_score,daily_study_hours,attendance_percentage,homework_completion_rate,sleep_hours,screen_time_hours,physical_activity_minutes,parent_education_level,study_environment,average_prev_score,pass
0,S0001,60.137241,D,1,60.599707,61.488212,53.568119,64.972292,1.427203,75.738405,68.534371,6.809352,3.313096,65.059425,Master,Noisy,60.009541,1
1,S0002,99.021977,A,1,92.289287,85.612565,91.873759,89.040461,4.813612,89.602736,91.990197,5.567793,4.925359,76.016617,High School,Quiet,88.842262,1
2,S0003,70.522955,C,1,80.259667,82.160656,72.736065,74.243663,1.240908,81.495426,69.669666,6.702875,5.107888,113.616872,High School,Moderate,76.380128,1
3,S0004,63.448537,D,1,72.926217,75.979145,76.726496,67.715995,2.190601,71.472047,71.976757,7.854439,3.772446,108.68669,High School,Noisy,73.473879,1
4,S0005,66.483019,C,1,48.581025,51.379977,48.993224,46.145011,2.192265,64.276582,68.940591,7.662429,1.898989,42.107294,Master,Quiet,48.839404,0


In [None]:
# Preprocessing: Encode categorical features using LabelEncoder
from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()

for col in data.columns:
    if data[col].dtype == "object": # Check for object type columns (categorical)
        data[col] = le.fit_transform(data[col]) # Apply Label Encoding

In [None]:
# Define features (X) and target (y)
# X contains all columns except 'pass' and 'average_prev_score'
X = data.drop(["pass", "average_prev_score"], axis=1)
# y is the target variable 'pass'
y = data["pass"]

# Print the shapes of X and y
print(X.shape, y.shape)

(1000, 16) (1000,)


In [None]:
# Split the dataset into training and testing sets
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42 # 80% for training, 20% for testing, fixed random state for reproducibility
)

In [None]:
# Model Training: Initialize and train a Logistic Regression model
from sklearn.linear_model import LogisticRegression

model = LogisticRegression(max_iter=1000) # Initialize Logistic Regression with increased max_iter for convergence
model.fit(X_train, y_train) # Train the model on the training data

print("Model trained")

Model trained


STOP: TOTAL NO. OF ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [None]:
# Model Evaluation: Make predictions and evaluate performance
from sklearn.metrics import accuracy_score, classification_report

y_pred = model.predict(X_test) # Make predictions on the test set

# Print evaluation metrics
print("Accuracy:", accuracy_score(y_test, y_pred))
print("\nClassification Report:\n")
print(classification_report(y_test, y_pred))

Accuracy: 0.965

Classification Report:

              precision    recall  f1-score   support

           0       0.93      0.91      0.92        43
           1       0.97      0.98      0.98       157

    accuracy                           0.96       200
   macro avg       0.95      0.94      0.95       200
weighted avg       0.96      0.96      0.96       200



In [None]:
# Individual Prediction: Select a random student for prediction
import random

index = random.randint(0, len(X) - 1) # Get a random index from the dataset
student_data = X.iloc[index] # Select the student's data using the random index

print("Selected student index:", index)
print("\nOriginal student data (from dataset):")
print(student_data)

Selected student index: 208

Original student data (from dataset):
student_id                   208.000000
final_score                   42.668157
grade                          4.000000
pass_fail                      0.000000
previous_score                40.278387
math_prev_score               39.173566
science_prev_score            37.070979
language_prev_score           34.627368
daily_study_hours              4.262982
attendance_percentage         61.679243
homework_completion_rate      81.028237
sleep_hours                    7.090045
screen_time_hours              5.015228
physical_activity_minutes    115.917376
parent_education_level         1.000000
study_environment              1.000000
Name: 208, dtype: float64


In [None]:
# Reshape the selected student's data for model input
# The model expects a 2D array (1 sample, n_features)
student_input = student_data.values.reshape(1, -1)

print("\nReshaped input passed to model:")
print(student_input)


Reshaped input passed to model:
[[208.          42.6681567    4.           0.          40.27838711
   39.17356624  37.07097914  34.62736849   4.26298152  61.67924307
   81.02823723   7.09004494   5.01522762 115.9173762    1.
    1.        ]]


In [None]:
# Make a prediction for the selected student
prediction = model.predict(student_input)

print("\nPrediction result:")
if prediction[0] == 1:
    print("Student will PASS")
else:
    print("Student will FAIL")


Prediction result:
Student will FAIL


