## Business Understanding

Educational institutions constantly seek to improve student performance and reduce failure or dropout rates. However, many schools struggle to identify students who are at risk of poor academic outcomes early enough for effective intervention. Traditional evaluation methods rely on exam results, which only reveal problems after they occur.

Data science offers a proactive solution by using student data, such as demographics, study habits, attendance, and previous grades, to predict academic performance before final exams.

#### Problem Statement

The goal of this project is to develop a **machine learning model** capable of predicting a student’s academic performance based on historical and behavioral data. By identifying students likely to perform poorly, schools can implement early interventions such as mentoring, tutoring, or parental engagement.

#### Project Objective

To analyze student data and build a predictive model that:

**1.** Classifies students into performance categories (e.g., Low, Average, High), or

**2.** Predicts their final grade (numerical value).

The insights from this model will help teachers and administrators make **data-driven decisions** to improve learning outcomes and student retention.

#### Key Questions

 ~ Which factors most influence student performance?

 ~ Can we accurately predict whether a student will pass or fail before final exams?

 ~ How can educators use data insights to support at-risk students?

#### Expected Outcomes

 ~ A trained machine learning model that predicts student performance.

 ~ Identification of the top factors affecting academic success.

 ~ Clear recommendations for improving student outcomes based on data analysis.

## Data Understanding
#### Dataset Source

The dataset used in this project is the **Student Performance Dataset (Synthetic, Realistic)**, designed specifically for Machine Learning beginners.
It contains **1,000,000 rows of realistic student data** and is available on Kaggle.

Each record represents a single student with information about their study habits, attendance, class participation, and final performance score.
The dataset is synthetic but follows realistic patterns, making it ideal for training and evaluating regression and classification models.

#### Dataset Overview
**student_id** - Unique identifier for each student.
**weekly_self_study_hours** - Average weekly self-study hours (ranging from 0 to 40).
**attendance_percentage** - Attendance percentage (between 50 and 100).
**class_participation** - Level of participation in class activities (score between 0 and 10).
**total_score** - Final performance score (0 to 100). This is a continuous value used for regression.
**grade** - Final letter grade (A, B, C, D, F) derived from total_score. Used for classification.

#### Target Variables

This dataset allows us to approach the problem in two different ways:

**Regression Task:** Predict the student’s total_score (a continuous numeric value).

**Classification Task:** Predict the grade (a categorical label representing performance levels A–F).

We’ll later experiment with both approaches to compare model performance and interpretability.

#### Initial Data Goals

At this stage, the main objectives are to:

**1.** Load and inspect the dataset to understand its structure and contents.

**2.** Check for missing or inconsistent data.

**3.** Identify data types for each column.

**4.** Generate basic summary statistics (mean, median, standard deviation).

**5.** Get an overview of value distributions to guide preprocessing and model design.

In [3]:
import pandas as pd

# Load dataset
data = pd.read_csv('student_performance.csv')

# Display basic info
print("Dataset shape:", data.shape)
display(data.head())

# Check data types and nulls
print("\nData Info:")
data.info()

Dataset shape: (1000000, 6)


Unnamed: 0,student_id,weekly_self_study_hours,attendance_percentage,class_participation,total_score,grade
0,1,18.5,95.6,3.8,97.9,A
1,2,14.0,80.0,2.5,83.9,B
2,3,19.5,86.3,5.3,100.0,A
3,4,25.7,70.2,7.0,100.0,A
4,5,13.4,81.9,6.9,92.0,A



Data Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000000 entries, 0 to 999999
Data columns (total 6 columns):
 #   Column                   Non-Null Count    Dtype  
---  ------                   --------------    -----  
 0   student_id               1000000 non-null  int64  
 1   weekly_self_study_hours  1000000 non-null  float64
 2   attendance_percentage    1000000 non-null  float64
 3   class_participation      1000000 non-null  float64
 4   total_score              1000000 non-null  float64
 5   grade                    1000000 non-null  object 
dtypes: float64(4), int64(1), object(1)
memory usage: 45.8+ MB


## Data Preparation
#### Objective

The goal of this step is to clean, organize, and prepare the dataset for modeling.
Although the dataset is synthetic and relatively clean, we must still ensure the data is consistent, correctly formatted, and ready for both regression and classification models.

#### Key Preparation Steps

**Check and handle missing values** – Ensure there are no null or empty records.

**Convert data types** – Verify all numeric columns are in the correct format.

**Encode categorical variables** – Convert letter grades (A–F) into numeric form for modeling.

**Feature scaling (optional)** – Normalize or standardize numeric columns for certain algorithms.

**Split the dataset** – Separate data into training and testing sets for model evaluation.

In [5]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder

# Load dataset
data = pd.read_csv('student_performance.csv')

# 1. Check for missing values
print("Missing values per column:\n", data.isnull().sum())

# 2. Confirm data types
print("\nData types:\n", data.dtypes)

# 3. Encode the 'grade' column for classification
label_encoder = LabelEncoder()
data['grade_encoded'] = label_encoder.fit_transform(data['grade'])

print("\nUnique grades:", data['grade'].unique())
print("Encoded mapping:", dict(zip(label_encoder.classes_, label_encoder.transform(label_encoder.classes_))))

# 4. Check basic statistics
print("\nSummary statistics:\n")
display(data.describe())

# 5. Display sample of cleaned data
display(data.head())


Missing values per column:
 student_id                 0
weekly_self_study_hours    0
attendance_percentage      0
class_participation        0
total_score                0
grade                      0
dtype: int64

Data types:
 student_id                   int64
weekly_self_study_hours    float64
attendance_percentage      float64
class_participation        float64
total_score                float64
grade                       object
dtype: object

Unique grades: ['A' 'B' 'C' 'D' 'F']
Encoded mapping: {'A': np.int64(0), 'B': np.int64(1), 'C': np.int64(2), 'D': np.int64(3), 'F': np.int64(4)}

Summary statistics:



Unnamed: 0,student_id,weekly_self_study_hours,attendance_percentage,class_participation,total_score,grade_encoded
count,1000000.0,1000000.0,1000000.0,1000000.0,1000000.0,1000000.0
mean,500000.5,15.029127,84.711046,5.985203,84.283845,0.701944
std,288675.278932,6.899431,9.424143,1.956421,15.432969,0.915213
min,1.0,0.0,50.0,0.0,9.4,0.0
25%,250000.75,10.3,78.3,4.7,73.9,0.0
50%,500000.5,15.0,85.0,6.0,87.5,0.0
75%,750000.25,19.7,91.8,7.3,100.0,1.0
max,1000000.0,40.0,100.0,10.0,100.0,4.0


Unnamed: 0,student_id,weekly_self_study_hours,attendance_percentage,class_participation,total_score,grade,grade_encoded
0,1,18.5,95.6,3.8,97.9,A,0
1,2,14.0,80.0,2.5,83.9,B,1
2,3,19.5,86.3,5.3,100.0,A,0
3,4,25.7,70.2,7.0,100.0,A,0
4,5,13.4,81.9,6.9,92.0,A,0
