
# Student Performance Analytics: A Data Science Project

This project focuses on predicting student academic performance. We will investigate how student data, including demographic, social, and school-related attributes, along with early academic performance indicators, can be leveraged to identify students at risk of academic failure, understand drivers of course engagement, and facilitate personalized learning interventions.

## Project Overview

Student performance is a critical indicator of educational system effectiveness. Traditional assessment methods often lack the holistic view or early warning signs needed to address potential issues proactively. This project utilizes data-driven approaches to provide insights and predictive capabilities for educational improvement.



## Data Sourcing and Description

This project utilizes the **Student Performance Data Set** from the UCI Machine Learning Repository [1]. This dataset contains student academic performance data from two Portuguese secondary schools, covering two distinct subjects: Mathematics and Portuguese language.

**Dataset Source:** [UCI Machine Learning Repository - Student Performance](https://archive.ics.uci.edu/dataset/320/student+performance)

### Dataset Files

*   `student-mat.csv`: Contains data for students in a Mathematics course.
*   `student-por.csv`: Contains data for students in a Portuguese language course.

### Features

The dataset includes 33 attributes for each student, categorized as follows:

*   **Demographic Information:** `sex`, `age`, `address`, `famsize`, `Pstatus`, `Medu` (mother's education), `Fedu` (father's education).
*   **Social and Personal Factors:** `Mjob` (mother's job), `Fjob` (father's job), `reason` (reason to choose school), `guardian`, `traveltime`, `studytime`, `failures`, `schoolsup` (extra educational support), `famsup` (family educational support), `paid` (extra paid classes), `activities` (extra-curricular activities), `nursery` (attended nursery school), `higher` (wants to take higher education), `internet` (Internet access), `romantic` (with a romantic relationship), `famrel` (quality of family relationships), `freetime`, `goout` (going out with friends), `Dalc` (workday alcohol consumption), `Walc` (weekend alcohol consumption), `health` (current health status), `absences`.
*   **School-related Information:** `school` (student's school), `course` (subject: Math or Portuguese).
*   **Grades:** `G1` (first period grade), `G2` (second period grade), `G3` (final grade).

**Note:** The target variable for prediction in this project is `G3`.



## Data Cleaning and Preprocessing

We will clean and preprocess the raw datasets, preparing them for exploratory data analysis and machine learning model training. The goal is to combine the two subject-specific datasets, handle categorical variables, and ensure data consistency.


In [None]:

import pandas as pd

# --- Loading the Data ---
print("Loading student-mat.csv...")
mat_df = pd.read_csv('student-mat.csv', delimiter=';')
print("Loading student-por.csv...")
por_df = pd.read_csv('student-por.csv', delimiter=';')

# Displaying the Data before being proccessed
print('Displaying the Mathematics Dataset')
print(mat_df)
print('Displaying the Portuguese Dataset')
print(por_df)

# --- Add Subject Column and Combine DataFrames ---
# We add a new 'subject' column to differentiate between Mathematics and Portuguese students.
# The two dataframes will then be concatenated to create a single, comprehensive dataset.
print("Adding 'subject' column and combining dataframes...")
mat_df['subject'] = 'Math'
por_df['subject'] = 'Portuguese'
combined_df = pd.concat([mat_df, por_df], ignore_index=True)

# --- Data Cleaning: Strip Whitespace from String Columns ---
# This step ensures consistency by removing leading/trailing whitespace from all string-type columns.
print("Cleaning string columns (stripping whitespace)...")
for col in combined_df.select_dtypes(include=['object']).columns:
    combined_df[col] = combined_df[col].str.strip()

# --- Save Combined Data ---
# The combined and cleaned dataset is saved to a new CSV file for future use.
print("Saving combined_student_performance.csv...")
combined_df.to_csv('combined_student_performance.csv', index=False)
print("combined_student_performance.csv created.")
print("Shape of combined_df:", combined_df.shape)
print("First 5 rows of combined_df:")
print(combined_df.head())


In [None]:

# --- Feature Engineering and Encoding ---
# We will handle categorical variables by converting them into a numerical format
# suitable for machine learning models. Binary categorical variables are mapped to 0/1,
# and multi-category nominal variables are one-hot encoded.

print("Starting feature engineering and encoding...")

# Identify binary and multi-category nominal columns
binary_cols = ['school', 'sex', 'address', 'famsize', 'Pstatus', 'schoolsup', 'famsup', 'paid', 'activities', 'nursery', 'higher', 'internet', 'romantic', 'subject']
nominal_cols = ['Mjob', 'Fjob', 'reason', 'guardian']

# Apply one-hot encoding to nominal columns
print("Applying one-hot encoding to nominal columns...")
combined_df_encoded = pd.get_dummies(combined_df, columns=nominal_cols, drop_first=True)

# Convert binary columns to numerical (0 and 1)
print("Converting binary columns to numerical (0 and 1)...")
for col in binary_cols:
    if col in combined_df_encoded.columns:
        unique_values = combined_df_encoded[col].unique()
        if len(unique_values) == 2:
            mapping = {unique_values[0]: 0, unique_values[1]: 1}
            combined_df_encoded[col] = combined_df_encoded[col].map(mapping)
        else:
            print(f"Warning: Column {col} is not binary or has more than 2 unique values after encoding. Skipping direct 0/1 mapping.")

# --- Save Preprocessed Data ---
# We will save the fully preprocessed dataset to a new CSV file.
print("Saving preprocessed_student_performance.csv...")
combined_df_encoded.to_csv('preprocessed_student_performance.csv', index=False)
print("preprocessed_student_performance.csv created.")
print("Shape of preprocessed_student_performance.csv:", combined_df_encoded.shape)
print("First 5 rows of preprocessed_student_performance.csv:")
print(combined_df_encoded.head())



## Exploratory Data Analysis (EDA)

We will perform Exploratory Data Analysis (EDA) to understand the dataset's characteristics, identify patterns, and visualize relationships between variables. This will help us in gaining insights into student performance and potential influencing factors.


In [None]:

import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd

# Load the preprocessed data
print("Loading preprocessed_student_performance.csv for EDA...")
combined_df = pd.read_csv("preprocessed_student_performance.csv")

# --- Descriptive Statistics ---
# Display descriptive statistics for numerical columns and value counts for key categorical columns.
print("Descriptive statistics for numerical features:")
print(combined_df.describe())

print("Value counts for key categorical features (after encoding, some might be 0/1):")
print("School:", combined_df['school'].value_counts())
print("Sex:", combined_df['sex'].value_counts())
print("Address:", combined_df['address'].value_counts())
print("Subject:", combined_df['subject'].value_counts())


In [None]:

# --- Visualization 1: Distribution of Final Grades (G3) ---
# This histogram will visualize the distribution of the final grade (G3).
# It will helps us to understand the spread and common grade ranges among students.
print("Generating G3 distribution histogram...")
plt.figure(figsize=(10, 6))
sns.histplot(combined_df['G3'], kde=True, bins=20)
plt.title('Distribution of Final Grades (G3)')
plt.xlabel('Final Grade (G3)')
plt.ylabel('Number of Students')
plt.grid(axis='y', alpha=0.75)
plt.savefig('g3_distribution.png')
plt.show()
print("g3_distribution.png created.")


In [None]:

# --- Visualization 2: Correlation Heatmap of Numerical Features ---
# This heatmap shows the correlation matrix for all numerical features.
# It helps identify linear relationships between variables, especially with the target variable G3.
print("Generating correlation heatmap...")
plt.figure(figsize=(14, 10))
sns.heatmap(combined_df.corr(numeric_only=True), annot=False, cmap='coolwarm', fmt=".2f")
plt.title('Correlation Heatmap of Numerical Features')
plt.savefig('correlation_heatmap.png')
plt.show()
print("correlation_heatmap.png created.")


In [None]:

# --- Visualization 3: Box Plots of G3 by Categorical Features ---
# These box plots will compare the distribution of final grades (G3) across different categories
# of key nominal and binary features. This helps in understanding how these factors
# relate to academic performance.
print("Generating box plots for G3 by categorical features...")
plt.figure(figsize=(18, 12))

plt.subplot(2, 3, 1)
sns.boxplot(x='school', y='G3', data=combined_df)
plt.title('G3 by School')

plt.subplot(2, 3, 2)
sns.boxplot(x='sex', y='G3', data=combined_df)
plt.title('G3 by Sex')

plt.subplot(2, 3, 3)
sns.boxplot(x='address', y='G3', data=combined_df)
plt.title('G3 by Address')

plt.subplot(2, 3, 4)
sns.boxplot(x='famsize', y='G3', data=combined_df)
plt.title('G3 by Family Size')

plt.subplot(2, 3, 5)
sns.boxplot(x='Pstatus', y='G3', data=combined_df)
plt.title('G3 by Parents Cohabitation Status')

plt.subplot(2, 3, 6)
sns.boxplot(x='subject', y='G3', data=combined_df)
plt.title('G3 by Subject')

plt.tight_layout()
plt.savefig('categorical_g3_boxplots.png')
plt.show()
print("categorical_g3_boxplots.png created.")



## Methodology / Implementation

### 1. Feature and Target Definition

*   **Target Variable (y):** The final grade (`G3`) was selected as the target variable for prediction.
*   **Feature Variables (X):** All other preprocessed columns were considered as features. Crucially, the `G1` (first period grade) and `G2` (second period grade) were *excluded* from the feature set. This decision is made to address the challenge highlighted by the dataset creators: predicting `G3` without `G1` and `G2` is more difficult but significantly more useful for early intervention strategies.

### 2. Data Splitting

The preprocessed dataset was will be split into training and testing sets (80% for training, 20% for testing) using `random_state=42` for reproducibility.

### 3. Model Selection and Training

Three distinct regression models will be selected for their varying approaches to capturing relationships within the data:

*   **Linear Regression:** A baseline model for understanding linear relationships.
*   **Random Forest Regressor:** An ensemble method robust against overfitting and capable of handling non-linear relationships.
*   **Gradient Boosting Regressor:** A powerful ensemble technique known for high predictive accuracy by iteratively correcting errors.

### 4. Model Persistence

Each trained model will be saved using the `joblib` library for later use and evaluation.



## Model Building and Training

 We will now be preparing the data for modeling, splitting it into training and testing sets, training the selected regression models, and saving the trained models.


In [None]:

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
import joblib

# Load the preprocessed data
print("Loading preprocessed_student_performance.csv for model building...")
combined_df = pd.read_csv("preprocessed_student_performance.csv")

# --- Define Features (X) and Target (y) ---
# G1 and G2 will be excluded from features to make the prediction more challenging and useful for early intervention.
print("Defining features (X) and target (y). G1 and G2 are excluded from features.")
X_all = combined_df.drop(["G1", "G2", "G3"], axis=1)
y = combined_df["G3"]

# Convert boolean columns to int for model compatibility if any exist
# This step ensures that any remaining boolean columns (from one-hot encoding) are numerical.
print("Converting boolean columns to int for model compatibility...")
for col in X_all.select_dtypes(include=["bool"]).columns:
    X_all[col] = X_all[col].astype(int)

# --- Split Data into Training and Testing Sets ---
# The data is split into 80% for training and 20% for testing to evaluate model performance on unseen data.
print("Splitting data into training (80%) and testing (20%) sets...")
X_train, X_test, y_train, y_test = train_test_split(X_all, y, test_size=0.2, random_state=42)

print("Shape of X_train:", X_train.shape)
print("Shape of X_test:", X_test.shape)
print("Shape of y_train:", y_train.shape)
print("Shape of y_test:", y_test.shape)

# --- Save Test Data ---
# The test data (features and target) will be saved to CSV files. These will be used later for model evaluation.
print("Saving X_test.csv and y_test.csv...")
X_test.to_csv("X_test.csv", index=False)
y_test.to_csv("y_test.csv", index=False)
print("X_test.csv and y_test.csv created.")

# --- Initialize and Train Models ---
# Three regression models will be initialized and trained on the training data.
# Each trained model will then be saved using joblib for persistence.
models = {
    "Linear Regression": LinearRegression(),
    "Random Forest Regressor": RandomForestRegressor(random_state=42),
    "Gradient Boosting Regressor": GradientBoostingRegressor(random_state=42)
}

for name, model in models.items():
    print(f"Training {name}...")
    model.fit(X_train, y_train)
    print(f"{name} trained.")
    # Save the trained model to a .pkl file
    joblib.dump(model, f"{name.replace(' ', '_').lower()}.pkl")
    print(f"Saved {name} model to {name.replace(' ', '_').lower()}.pkl")



## Statistical Analysis and Evaluation

This section focuses on evaluating the performance of the trained machine learning models using standard regression metrics. It also discusses the assumptions made and the limitations of the current approach.


In [None]:

import pandas as pd
import joblib
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

# --- Load Test Data ---
# The previously saved test data is loaded to evaluate the trained models.
print("Loading X_test.csv and y_test.csv for evaluation...")
X_test = pd.read_csv("X_test.csv")
y_test = pd.read_csv("y_test.csv")

# Convert boolean columns to int for model compatibility if any exist
# This ensures consistency with the format used during model training.
print("Converting boolean columns in X_test to int...")
for col in X_test.select_dtypes(include=["bool"]).columns:
    X_test[col] = X_test[col].astype(int)

# --- Load Trained Models ---
# The trained models (saved as .pkl files) are loaded for prediction.
print("Loading trained models...")
models = {
    "Linear Regression": joblib.load("linear_regression.pkl"),
    "Random Forest Regressor": joblib.load("random_forest_regressor.pkl"),
    "Gradient Boosting Regressor": joblib.load("gradient_boosting_regressor.pkl")
}

# --- Model Evaluation ---
# Each model's performance is evaluated using MAE, MSE, RMSE, and R-squared.
print("Model Evaluation Results:")
results = {}
for name, model in models.items():
    print(f"Evaluating {name}...")
    y_pred = model.predict(X_test)
    mae = mean_absolute_error(y_test, y_pred)
    mse = mean_squared_error(y_test, y_pred)
    rmse = mse**0.5 # Root Mean Squared Error
    r2 = r2_score(y_test, y_pred)
    
    results[name] = {"MAE": mae, "MSE": mse, "RMSE": rmse, "R2": r2}
    
    print(f"--- {name} ---")
    print(f"Mean Absolute Error (MAE): {mae:.2f}")
    print(f"Mean Squared Error (MSE): {mse:.2f}")
    print(f"Root Mean Squared Error (RMSE): {rmse:.2f}")
    print(f"R-squared (R2): {r2:.2f}")

# --- Save Evaluation Results ---
# The evaluation results are compiled into a DataFrame and saved to a CSV file.
results_df = pd.DataFrame(results).T
print("Summary of Model Evaluation:")
print(results_df)
results_df.to_csv("model_evaluation_results.csv")
print("Model evaluation results saved to model_evaluation_results.csv.")



## Conclusion and References

### Conclusion

This research project successfully demonstrated the application of data science techniques to predict student academic performance and identify influential factors within secondary education. The Gradient Boosting Regressor emerged as the most effective model, providing a valuable tool for forecasting student final grades even with the intentional exclusion of highly correlated prior grades (G1 and G2), thereby enhancing its utility for early intervention strategies.

The findings from the exploratory data analysis and model evaluation underscore the complex interplay of demographic, social, and school-related factors in shaping student outcomes. Key insights revealed the significant impact of parental education, study time, and past academic failures on final grades. The proposed innovative applications—including an interactive dashboard, an early warning system API, and a personalized learning recommendations engine—illustrate the potential for translating these analytical insights into practical, actionable tools.

### References

1.  Cortez, P., & Silva, A. M. G. (2008). Using Data Mining to Predict Secondary School Student Performance. In A. Brito and J. Teixeira Eds., *Proceedings of 5th FUture BUsiness TEChnology Conference (FUBUTEC 2008)* (pp. 5-12). Porto, Portugal: EUROSIS. ISBN 978-9077381-39-7.
2.  UCI Machine Learning Repository: Student Performance. (n.d.). Retrieved from [https://archive.ics.uci.edu/dataset/320/student+performance](https://archive.ics.uci.edu/dataset/320/student+performance)
