# Training a model for the Student score Prediction

This notebook demonstrates the complete machine learning pipeline for predicting student scores based on various features. We'll cover data loading, preprocessing, EDA, feature engineering, model training, evaluation, and conclusion.

In [None]:
import pandas as pd ,numpy as np, sklearn as sk,matplotlib.pyplot as plt,seaborn as sns,kagglehub

from sklearn.preprocessing import StandardScaler,LabelEncoder

## Data Loading and Initial Exploration

This step involves downloading the dataset from Kaggle and loading it into a pandas DataFrame. We then perform initial exploration by displaying the first few rows to understand the structure and features of the data.

In [None]:
path = kagglehub.dataset_download("spscientist/students-performance-in-exams")



print("Path to dataset files:", path)
# Load Dataset
import os



files = os.listdir(path)

print("Files in the dataset directory:", files)



csv_file_path = os.path.join(path, files[0])



df = pd.read_csv(csv_file_path)

display(df.head())

## Data Cleaning and Preprocessing

In this step, we handle missing values by dropping rows with nulls. We also encode categorical variables using LabelEncoder to convert them into numerical format suitable for machine learning models.

In [None]:
le = LabelEncoder()

categorical_cols = ['gender', 'race/ethnicity', 'parental level of education', 'lunch', 'test preparation course']

for col in categorical_cols:

    df[col] = le.fit_transform(df[col])



display(df.head())

# Data Preprocessing
print(df.isnull().sum())

df = df.dropna()
X = df[['gender', 'race/ethnicity', 'parental level of education', 'lunch', 'test preparation course']]

y = df[['math score', 'reading score', 'writing score']]
# StandardScaler
scaler = StandardScaler()

X_scaled = scaler.fit_transform(X)

## Exploratory Data Analysis (EDA)

EDA helps us understand the data distribution and relationships between variables. We use visualizations like histograms and correlation heatmaps to identify patterns and insights.

In [None]:
# EDA code here
plt.figure(figsize=(10,6))
sns.histplot(df['math score'], kde=True)
plt.title('Distribution of Math Scores')
plt.show()

plt.figure(figsize=(10,6))
sns.heatmap(df.corr(), annot=True, cmap='coolwarm')
plt.title('Correlation Heatmap')
plt.show()

## Feature Engineering

Feature engineering involves creating new features or transforming existing ones to improve model performance. Here, we add polynomial features and create an average score feature.

In [None]:
# Feature Engineering
from sklearn.preprocessing import PolynomialFeatures

poly = PolynomialFeatures(degree=2, interaction_only=True)
X_poly = poly.fit_transform(X_scaled)

df['average score'] = df[['math score', 'reading score', 'writing score']].mean(axis=1)

print("Feature engineering completed.")

## Model Training and Evaluation

We train a machine learning model using the processed data. We evaluate its performance using metrics like MSE and R-squared, and visualize residuals to assess accuracy.

In [None]:
# Model Training
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.multioutput import MultiOutputRegressor
from sklearn.metrics import mean_squared_error, r2_score

X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42)

model = MultiOutputRegressor(RandomForestRegressor(random_state=42))
model.fit(X_train, y_train)

y_pred = model.predict(X_test)

mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print(f"Mean Squared Error: {mse}")
print(f"R-squared: {r2}")

# Visualization
residuals = y_test - y_pred
plt.figure(figsize=(10,6))
plt.scatter(y_pred, residuals)
plt.xlabel('Predicted')
plt.ylabel('Residuals')
plt.title('Residual Plot')
plt.show()

## Conclusion and Key Findings

This final step summarizes the project, highlighting key insights, model performance, and potential improvements for future work.

In [None]:
# Conclusion
print("### Conclusion and Key Findings")
print("1. Data Overview: The dataset contains student performance data with features like gender, race/ethnicity, parental education, lunch type, and test preparation course.")
print("2. Preprocessing: Categorical variables were encoded, and features were scaled.")
print("3. EDA Insights: Visualizations showed distributions and correlations.")
print("4. Feature Engineering: Polynomial features and average score were added.")
print("5. Model Performance: The model achieved MSE of", mse, "and R-squared of", r2)
print("6. Key Findings: The model provides a baseline for prediction, but further improvements are possible.")
print("7. Future Improvements: Consider more advanced models or additional features.")

# Save the model
import joblib
joblib.dump(model, 'student_score_model.pkl')
print("Model saved as 'student_score_model.pkl'")