# Student Performance Prediction

This notebook aims to predict student performance based on various factors such as hours studied, previous scores, extracurricular activities, sleep hours, and question papers practiced. We will focus on creating a model that provides realistic ("humanised") and unbiased predictions.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

# Set style for plots
sns.set(style="whitegrid")

## 1. Load and Inspect Data

In [None]:
# Load the dataset
file_path = 'student_ml/performance/dataset.csv'
df = pd.read_csv(file_path)

# Display first few rows
df.head()

In [None]:
# Check data types and missing values
df.info()

## 2. Exploratory Data Analysis (EDA)

We'll visualize the distributions and potential relationships between variables, with a specific focus on "Hours Studied".

In [None]:
# Relationship between Hours Studied and Performance Index
plt.figure(figsize=(10, 6))
sns.scatterplot(data=df, x='Hours Studied', y='Performance Index', alpha=0.6)
plt.title('Hours Studied vs. Performance Index')
plt.xlabel('Hours Studied')
plt.ylabel('Performance Index')
plt.show()

In [None]:
# Correlation Matrix
# First convert categorical 'Extracurricular Activities' to numeric for correlation
df_corr = df.copy()
df_corr['Extracurricular Activities'] = df_corr['Extracurricular Activities'].apply(lambda x: 1 if x == 'Yes' else 0)

plt.figure(figsize=(10, 8))
sns.heatmap(df_corr.corr(), annot=True, cmap='coolwarm', fmt=".2f")
plt.title('Correlation Heatmap')
plt.show()

## 3. Preprocessing

We need to encode the categorical variable `Extracurricular Activities` and split the data into training and testing sets.

In [None]:
# Encoding Categorical Variable
df['Extracurricular Activities'] = df['Extracurricular Activities'].map({'Yes': 1, 'No': 0})

# Defining Features (X) and Target (y)
X = df.drop('Performance Index', axis=1)
y = df['Performance Index']

# Splitting the Data (80% Train, 20% Test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

## 4. Modeling (Linear Regression)

We'll use Linear Regression as it allows for easy interpretability, which is crucial for understanding how each factor (like hours studied) contributes to the final score.

In [None]:
# Initialize and Train Model
model = LinearRegression()
model.fit(X_train, y_train)

In [None]:
# Coefficients Interpretation
coefficients = pd.DataFrame({'Feature': X.columns, 'Coefficient': model.coef_})
print("Intercept:", model.intercept_)
print(coefficients)

## 5. Evaluation & Humanised Predictions

We evaluate the model using standard metrics. To ensure "humanised" predictions:
1. We clip predictions to be within the valid range [0, 100].
2. We round predictions to the nearest integer, as scores are typically whole numbers.

In [None]:
# Make Predictions
y_pred_raw = model.predict(X_test)

# Humanise Predictions: Clip to [0, 100] and Round
y_pred_humanised = np.clip(y_pred_raw, 0, 100)
y_pred_humanised = np.round(y_pred_humanised)

# Evaluation Metrics
mae = mean_absolute_error(y_test, y_pred_humanised)
mse = mean_squared_error(y_test, y_pred_humanised)
r2 = r2_score(y_test, y_pred_humanised)

print(f"Mean Absolute Error (MAE): {mae:.2f}")
print(f"Mean Squared Error (MSE): {mse:.2f}")
print(f"R-squared Score: {r2:.4f}")

## 6. Bias Check

We check the residuals to ensure there are no systematic errors. A "good" unbiased model should have normally distributed residuals centered around zero.

In [None]:
residuals = y_test - y_pred_raw

plt.figure(figsize=(10, 6))
sns.histplot(residuals, kde=True)
plt.title('Distribution of Residuals')
plt.xlabel('Residual (Actual - Predicted)')
plt.show()

## 7. Example Prediction
Let's test the model with a hypothetical student.

In [None]:
# Example: Student who studies 7 hours, previous score 80, does extracurriculars, sleeps 8 hours, practiced 5 papers.
new_student = pd.DataFrame({
    'Hours Studied': [7],
    'Previous Scores': [80],
    'Extracurricular Activities': [1], # Yes
    'Sleep Hours': [8],
    'Sample Question Papers Practiced': [5]
})

prediction = model.predict(new_student)
final_score = np.clip(np.round(prediction[0]), 0, 100)

print(f"Predicted Performance Index: {final_score}")