# Student Performance Prediction

This project focuses on a dataset that contains various information on students, such as their study habits, attendance, parental involvement, and other aspects influencing academic success. This purpose of this project is to develop a predictive model that can accurately predict the final exam scores of these students. 

The dataset is provided by a user from Kaggle (https://www.kaggle.com/datasets/lainguyn123/student-performance-factors). We will be using Python and various tools such as machine learning algorithms, preprocessing and pipelines to create our model. 

## Import Libraries

In [187]:
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
import numpy as np
from matplotlib import pyplot as plt 
import seaborn as sns
import pandas as pd
from sklearn.model_selection import GridSearchCV, train_test_split, RandomizedSearchCV, cross_val_score
from sklearn.linear_model import LinearRegression, Lasso, Ridge
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder, OrdinalEncoder
from sklearn.metrics import mean_squared_error, r2_score 
from sklearn.compose import ColumnTransformer
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor, StackingRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.neighbors import KNeighborsRegressor

## Load and Inspect Data 

In [188]:
df = pd.read_csv('StudentPerformanceFactors.csv')
df.info()
df.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6607 entries, 0 to 6606
Data columns (total 20 columns):
 #   Column                      Non-Null Count  Dtype 
---  ------                      --------------  ----- 
 0   Hours_Studied               6607 non-null   int64 
 1   Attendance                  6607 non-null   int64 
 2   Parental_Involvement        6607 non-null   object
 3   Access_to_Resources         6607 non-null   object
 4   Extracurricular_Activities  6607 non-null   object
 5   Sleep_Hours                 6607 non-null   int64 
 6   Previous_Scores             6607 non-null   int64 
 7   Motivation_Level            6607 non-null   object
 8   Internet_Access             6607 non-null   object
 9   Tutoring_Sessions           6607 non-null   int64 
 10  Family_Income               6607 non-null   object
 11  Teacher_Quality             6529 non-null   object
 12  School_Type                 6607 non-null   object
 13  Peer_Influence              6607 non-null   obje

Unnamed: 0,Hours_Studied,Attendance,Parental_Involvement,Access_to_Resources,Extracurricular_Activities,Sleep_Hours,Previous_Scores,Motivation_Level,Internet_Access,Tutoring_Sessions,Family_Income,Teacher_Quality,School_Type,Peer_Influence,Physical_Activity,Learning_Disabilities,Parental_Education_Level,Distance_from_Home,Gender,Exam_Score
0,23,84,Low,High,No,7,73,Low,Yes,0,Low,Medium,Public,Positive,3,No,High School,Near,Male,67
1,19,64,Low,Medium,No,8,59,Low,Yes,2,Medium,Medium,Public,Negative,4,No,College,Moderate,Female,61
2,24,98,Medium,Medium,Yes,7,91,Medium,Yes,2,Medium,Medium,Public,Neutral,4,No,Postgraduate,Near,Male,74
3,29,89,Low,Medium,Yes,8,98,Medium,Yes,1,Medium,Medium,Public,Negative,4,No,High School,Moderate,Male,71
4,19,92,Medium,Medium,Yes,6,65,Medium,Yes,3,Medium,High,Public,Neutral,4,No,College,Near,Female,70


The dataset contains 6607 rows and 19 columns containing information on used cars. It seems that the 'Teacher_Quality', 'Distance_from_Home' and 'Parental_Education_Level' columns are the only columns that have null values.

Here's a summary of all the columns:

- **Hours_Studied**:	Number of hours spent studying per week.
- **Attendance**:	Percentage of classes attended.
- **Parental_Involvement**:	Level of parental involvement in the student's education (Low, Medium, High).
- **Access_to_Resources**:	Availability of educational resources (Low, Medium, High).
- **Extracurricular_Activities**:	Participation in extracurricular activities (Yes, No).
- **Sleep_Hours**:	Average number of hours of sleep per night.
- **Previous_Scores**:	Scores from previous exams.
- **Motivation_Level**:	Student's level of motivation (Low, Medium, High).
- **Internet_Access**:	Availability of internet access (Yes, No).
- **Tutoring_Sessions**:	Number of tutoring sessions attended per month.
- **Family_Income**:	Family income level (Low, Medium, High).
- **Teacher_Quality**:	Quality of the teachers (Low, Medium, High).
- **School_Type**:	Type of school attended (Public, Private).
- **Peer_Influence**:	Influence of peers on academic performance (Positive, Neutral, Negative).
- **Physical_Activity**:	Average number of hours of physical activity per week.
- **Learning_Disabilities**:	Presence of learning disabilities (Yes, No).
- **Parental_Education_Level**:	Highest education level of parents (High School, College, Postgraduate).
- **Distance_from_Home**:	Distance from home to school (Near, Moderate, Far).
- **Gender**:	Gender of the student (Male, Female).
- **Exam_Score**:	Final exam score.

## Data Cleaning and Preparation

### Dealing with null values 

We will use backward fill to replace the null values in the 'Teacher_Quality', 'Distance_from_Home' and 'Parental_Education_Level' columns.

In [189]:
# Forward filling null values
df['Teacher_Quality'].ffill(inplace = True)
df['Distance_from_Home'].ffill(inplace = True)
df['Parental_Education_Level'].ffill(inplace = True)

### Dealing with outliers

We managed to find a student with an exam score of 101 which is impossible since the maximum score is 100 hence, we will replace this value.

In [190]:
df[df.Exam_Score == 101]

Unnamed: 0,Hours_Studied,Attendance,Parental_Involvement,Access_to_Resources,Extracurricular_Activities,Sleep_Hours,Previous_Scores,Motivation_Level,Internet_Access,Tutoring_Sessions,Family_Income,Teacher_Quality,School_Type,Peer_Influence,Physical_Activity,Learning_Disabilities,Parental_Education_Level,Distance_from_Home,Gender,Exam_Score
1525,27,98,Low,Medium,Yes,6,93,Low,No,5,High,High,Public,Positive,3,No,High School,Moderate,Female,101


In [191]:
# Replacing outlier value
df.Exam_Score.replace(101, 100, inplace = True)

## Data Transformation

### Converting columns to Boolean Category

In [192]:
bin_cols = ['Extracurricular_Activities', 'Internet_Access', 'Learning_Disabilities', 'Gender']

df['Extracurricular_Activities'].replace({'Yes':1, 'No':0}, inplace = True)
df['Internet_Access'].replace({'Yes':1, 'No':0}, inplace = True)
df['Learning_Disabilities'].replace({'Yes':1, 'No':0}, inplace = True)
df['Gender'].replace({'Male':1, 'Female':0}, inplace = True)

df[bin_cols] = df[bin_cols].astype('bool')

### Ordinal Encoding

In [193]:
# Define orders for ordinal encoding
education_order = ['High School', 'College', 'Postgraduate']
quality_order = ['Low', 'Medium', 'High']
distance_order = ['Near', 'Moderate', 'Far']

ordinal_enc = OrdinalEncoder(categories=[quality_order, quality_order, quality_order, quality_order, quality_order, education_order, 
                                        distance_order])

## Feature Importances

In [194]:
features = df.drop(columns = ['Exam_Score']).columns
target = ['Exam_Score']

X = df[features]
y = df[target]
a
num_cols = ['Hours_Studied', 'Attendance', 'Sleep_Hours', 'Previous_Scores', 'Tutoring_Sessions', 'Physical_Activity']
cat_cols = ['School_Type', 'Peer_Influence']
ord_cols = ['Parental_Involvement', 'Access_to_Resources', 'Motivation_Level', 'Family_Income', 'Teacher_Quality', 'Parental_Education_Level',
             'Distance_from_Home']
bin_cols = ['Extracurricular_Activities', 'Internet_Access', 'Learning_Disabilities', 'Gender']

preprocessor = ColumnTransformer(
    transformers = [
        ('num_vals', StandardScaler(), num_cols),
        ('cat_vals', OneHotEncoder(sparse = False, drop = 'first'), cat_cols),
        ('bin_vals', 'passthrough', bin_cols),
        ('ord_vals', ordinal_enc, ord_cols)
    ]
)

In [195]:
# Apply the transformations to the training data
X_preprocessed = preprocessor.fit_transform(X)
X_preprocessed = pd.DataFrame(X_preprocessed, columns=preprocessor.get_feature_names_out())

# Split the data into train and test sets
x_train_processed, x_test_processed, y_train_processed, y_test_processed = train_test_split(X_preprocessed, y, test_size=0.2, random_state=0)

In [196]:
# Initialize the GradientBoostingRegressor
rgr = DecisionTreeRegressor(criterion = 'squared_error', random_state = 0)

# Fit the model to the training data
rgr.fit(x_train_processed, y_train_processed)

In [197]:
# Get feature importances
importances = rgr.feature_importances_

# Create a DataFrame to view feature importances
feature_importances = pd.DataFrame({'feature': x_train_processed.columns, 'importance': importances})

# Sort by importance
feature_importances = feature_importances.sort_values(by='importance', ascending=False)

# Print features and importances
print(feature_importances)

                                 feature  importance
1                   num_vals__Attendance    0.391719
0                num_vals__Hours_Studied    0.243539
3              num_vals__Previous_Scores    0.078951
4            num_vals__Tutoring_Sessions    0.040355
13        ord_vals__Parental_Involvement    0.034599
5            num_vals__Physical_Activity    0.029081
14         ord_vals__Access_to_Resources    0.028099
2                  num_vals__Sleep_Hours    0.021799
18    ord_vals__Parental_Education_Level    0.019748
6           cat_vals__School_Type_Public    0.019296
16               ord_vals__Family_Income    0.015540
8      cat_vals__Peer_Influence_Positive    0.012990
12                      bin_vals__Gender    0.012652
17             ord_vals__Teacher_Quality    0.012360
15            ord_vals__Motivation_Level    0.011327
19          ord_vals__Distance_from_Home    0.010273
11       bin_vals__Learning_Disabilities    0.008333
10             bin_vals__Internet_Access    0.

As shown above, the numerical cols have a higher importance compared to the categorical columns. Attendance has the highest impact on the model compared to all the features, followed by hours studied and previous scores. 

## Model Selection and Evaluation

We will be testing on 3 different models (Decision Tree Regressor, Random Forest Regressor and Gradient Boosting Regressor) to determine which ones perform the best. 

In [198]:
new_features = df.drop(columns = ['Exam_Score']).columns

X = df[new_features]
y = df[target]

num_cols = ['Hours_Studied', 'Attendance', 'Sleep_Hours', 'Previous_Scores', 'Tutoring_Sessions', 'Physical_Activity']
cat_cols = ['School_Type', 'Peer_Influence']
ord_cols = ['Parental_Involvement', 'Access_to_Resources', 'Motivation_Level', 'Family_Income', 'Teacher_Quality', 'Parental_Education_Level',
             'Distance_from_Home']
bin_cols = ['Extracurricular_Activities', 'Internet_Access', 'Learning_Disabilities', 'Gender']

ordinal_enc = OrdinalEncoder(categories=[quality_order, quality_order, quality_order, quality_order, quality_order, education_order, 
                                        distance_order])

preprocessor = ColumnTransformer(
    transformers = [
        ('num_vals', StandardScaler(), num_cols),
        ('cat_vals', OneHotEncoder(sparse = False, drop = 'first'), cat_cols),
        ('bin_vals', 'passthrough', bin_cols),
        ('ord_vals', ordinal_enc, ord_cols)
    ]
)

# Split the data into train and test sets
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state = 0)

### Decision Tree Regressor 

In [199]:
# Initialising DecisionTreeRegressor
dtr = DecisionTreeRegressor(random_state = 0)

pipeline_dtr = Pipeline([('preprocessing', preprocessor), ('regressor', dtr)])

In [200]:
# Fit the pipeline on the training data and predict on test set
pipeline_dtr.fit(x_train, y_train)

# Predicting y values from test data
y_pred = pipeline_dtr.predict(x_test)

# Pipeline score
train_score = pipeline_dtr.score(x_train, y_train)
test_score = pipeline_dtr.score(x_test, y_test)
print(f'Decision Tree Regressor Train Score: {train_score}')
print(f'Decision Tree Regressor Test Score: {test_score}')

Decision Tree Regressor Train Score: 1.0
Decision Tree Regressor Test Score: 0.16773277222514482


### Random Forest Regressor

In [201]:
# Initializing RandomForestRegressor
rfr = RandomForestRegressor(random_state = 0)

pipeline_rfr = Pipeline([('preprocessing', preprocessor), ('regressor', rfr)])

In [202]:
# Fit the pipeline on the training data and predict on test set
pipeline_rfr.fit(x_train, y_train.values.ravel())

# Predicting y values from test data
y_pred = pipeline_rfr.predict(x_test)

#Pipeline score
train_score = pipeline_rfr.score(x_train, y_train)
test_score = pipeline_rfr.score(x_test, y_test)
print(f'Random Forest Regressor Train Score: {train_score}')
print(f'Random Forest Regressor Test Score: {test_score}')

Random Forest Regressor Train Score: 0.9489910246121348
Random Forest Regressor Test Score: 0.6050183903208839


### Gradient Boosting Regressor

In [203]:
# Initializing RandomForestRegressor
gbr = GradientBoostingRegressor(random_state = 0)

pipeline_gbr = Pipeline([('preprocessing', preprocessor), ('regressor', gbr)])

In [204]:
# Fit the pipeline on the training data and predict on test set
pipeline_gbr.fit(x_train, y_train.values.ravel())

# Predicting y values from test data
y_pred = pipeline_gbr.predict(x_test)

#Pipeline score
train_score = pipeline_gbr.score(x_train, y_train)
test_score = pipeline_gbr.score(x_test, y_test)
print(f'Gradient Boosting Regressor Train Score: {train_score}')
print(f'Gradient Boosting Regressor Test Score: {test_score}')

Gradient Boosting Regressor Train Score: 0.7597874323149514
Gradient Boosting Regressor Test Score: 0.6501478623055192


Decision Tree Regression (DTR) seems to be overfitting (high training score, low testing score) a lot more in comparison to Gradient Boosting Regressor (GBR) and and Random Forest Regression (RFR).

## Hyperparameter Tuning

We have decided to use Gradient Boosting Regressor to be tuned for better performance since its training and testing scores show a higher prospect. We will use 5-fold cross validation to calculate the average root mean squared error (RMSE) score to evaluate the performance of our model.

In [205]:
# Perform 5-fold cross-validation
cv_scores = cross_val_score(pipeline_gbr, x_train, y_train.values.ravel(), cv=5, scoring='neg_mean_squared_error')

# Convert the negative MSE to positive and take the square root
rmse_scores = (-cv_scores) ** 0.5

# Display the RMSE for each fold and the average RMSE
print("RMSE for each fold: ", rmse_scores)
print("Average RMSE: ", rmse_scores.mean())

RMSE for each fold:  [1.52405975 2.34222442 1.88654995 1.47311694 2.87339772]
Average RMSE:  2.019869754618342


With RMSE values ranging from approximately 1.47 to 2.87, the model's performance varies a bit depending on the subset of data it’s trained and tested on. This could mean that the model is sensitive to data splits which could indicate that it is not capturing the underlying patterns or there are outliers/noise in the data. 

In [206]:
# Initialising GBR with hyperparameter tuning 
gbr = GradientBoostingRegressor(random_state = 0, n_estimators = 160, learning_rate = 0.2, max_depth = 2, min_samples_split = 10,
                               min_samples_leaf = 40, subsample = 0.8)

# Fitting regressor to training data
pipeline_gbr_tuned = Pipeline([('preprocessing', preprocessor), ('regressor', gbr)])
pipeline_gbr_tuned.fit(x_train, y_train.values.ravel())

# Predicting y values from test data
y_pred = pipeline_gbr_tuned.predict(x_test)

In [207]:
# Perform 5-fold cross-validation
cv_scores = cross_val_score(pipeline_gbr_tuned, x_train, y_train.values.ravel(), cv=5, scoring='neg_mean_squared_error')

# Convert the negative MSE to positive and take the square root
rmse_scores = (-cv_scores) ** 0.5

#Pipeline score
train_score = pipeline_gbr_tuned.score(x_train, y_train)
test_score = pipeline_gbr_tuned.score(x_test, y_test)
print(f'Gradient Boosting Regressor Train Score: {train_score}')
print(f'Gradient Boosting Regressor Test Score: {test_score}')

# Display the RMSE for each fold and the average RMSE
print("RMSE for each fold: ", rmse_scores)
print("Average RMSE: ", rmse_scores.mean())

Gradient Boosting Regressor Train Score: 0.7610921788906002
Gradient Boosting Regressor Test Score: 0.6710664100056156
RMSE for each fold:  [1.46197503 2.25354224 1.80597824 1.37240294 2.80707325]
Average RMSE:  1.940194341138131


After tuning some of GBR's parameters, we managed to improve to training and testing score of our model by a small percentage. We also managed to improved the RMSE score by reducing it from approximately 2.02 to 1.94.

## Conclusion

The purpose of this project was to create a maching learning model that could accurately predict the exam scores of students based on numerous factors such as hours studied, parental involvement, extracurricular activities and many more. After experimenting with different models, we came to the conclusion that a Gradient Boosting Regressor model would perform best. We managed to tune its parameters and create a model that can explain 67% of the variance in unseen data and has a average RMSE score of 1.94.