## Why Data Splitting Matters

In supervised machine learning, the goal is to build a model that learns how to correctly connect inputs to outputs. The inputs are often called **features** or **predictors**, while the outputs are known as **targets** or **responses**.

How well a model performs depends on the type of problem you’re solving.  
- For **regression** tasks, performance is usually measured using metrics like **R²**, **root mean square error (RMSE)**, or **mean absolute error (MAE)**.  
- For **classification** problems, common metrics include **accuracy**, **precision**, **recall**, and the **F1 score**.

There’s no single *perfect* value for these metrics—what’s considered good performance can vary widely depending on the industry or use case. While many resources explain these metrics in detail, the most important thing to remember is this:

> **A model must be evaluated fairly to be trusted.**

You can’t accurately judge a model’s performance using the same data it was trained on, because the model has already *seen* that data. This would result in overly optimistic performance scores. Instead, the model should be tested on **new, unseen data** to understand how well it will perform in real-world situations.

That’s why we split our dataset before training. One part is used to **train** the model, and another part is kept aside to **test** it. This separation ensures that performance metrics reflect the model’s true predictive ability, not just its ability to memorize the training data.


## Training, Validation, and Test Sets

Splitting your dataset is a key step in making sure your model’s performance is evaluated fairly and realistically. In most machine learning projects, the dataset is randomly divided into **three parts**:

### Training Set
The **training set** is used to teach the model. This is where the model learns patterns in the data by adjusting its internal parameters.  
For example, during training, a model learns the best weights or coefficients in algorithms like **linear regression**, **logistic regression**, or **neural networks**.

### Validation Set
The **validation set** is used to evaluate the model while you are fine-tuning it. This is especially useful during **hyperparameter tuning**.  
For instance, if you’re deciding how many neurons to use in a neural network or which kernel works best for a support vector machine, you try different options. For each option, the model is trained on the training set and evaluated using the validation set to see which configuration performs best.

### Test Set
The **test set** is reserved for the final evaluation of the model. It provides an unbiased measure of how well the model performs on completely new data.  
This dataset should **not** be used during training or validation, as doing so would compromise the fairness of the evaluation.

In simpler scenarios—where hyperparameter tuning isn’t needed—it’s often acceptable to work with just **training** and **test** sets.

---

## Underfitting and Overfitting

Splitting data also helps identify two common modeling problems: **underfitting** and **overfitting**.

### Underfitting
**Underfitting** occurs when a model is too simple to capture the underlying patterns in the data.  
For example, using a linear model to describe a clearly nonlinear relationship can lead to underfitting. These models usually perform poorly on both the training and test datasets because they fail to learn meaningful patterns.

### Overfitting
**Overfitting** happens when a model is too complex and learns not only the true patterns in the data but also the noise.  
Such models often perform extremely well on the training data but fail to generalize to new, unseen data. As a result, their performance on the test set is usually much worse.

---

Splitting your dataset properly helps you balance learning and generalization, ensuring that your model performs well not just on known data, but also in real-world scenarios.


In [9]:
# Import required libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import (
    train_test_split,
    KFold,
    StratifiedKFold,
    RepeatedKFold,
    LeaveOneOut,
    ShuffleSplit,
    StratifiedShuffleSplit,
    TimeSeriesSplit,
    GroupKFold,
    StratifiedGroupKFold,
    PredefinedSplit,
    cross_val_score
)
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import LabelEncoder
import warnings
warnings.filterwarnings('ignore')

# Set random seed
np.random.seed(42)



In [10]:


df = pd.read_csv('student_performance_updated_1000.csv')

In [11]:
df.shape

(1000, 12)

In [12]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 12 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   StudentID                  960 non-null    float64
 1   Name                       966 non-null    object 
 2   Gender                     952 non-null    object 
 3   AttendanceRate             960 non-null    float64
 4   StudyHoursPerWeek          950 non-null    float64
 5   PreviousGrade              967 non-null    float64
 6   ExtracurricularActivities  957 non-null    float64
 7   ParentalSupport            978 non-null    object 
 8   FinalGrade                 960 non-null    float64
 9   Study Hours                976 non-null    float64
 10  Attendance (%)             959 non-null    float64
 11  Online Classes Taken       975 non-null    object 
dtypes: float64(8), object(4)
memory usage: 93.9+ KB


In [13]:
df.head()

Unnamed: 0,StudentID,Name,Gender,AttendanceRate,StudyHoursPerWeek,PreviousGrade,ExtracurricularActivities,ParentalSupport,FinalGrade,Study Hours,Attendance (%),Online Classes Taken
0,1.0,John,Male,85.0,15.0,78.0,1.0,High,80.0,4.8,59.0,False
1,2.0,Sarah,Female,90.0,20.0,85.0,2.0,Medium,87.0,2.2,70.0,True
2,3.0,Alex,Male,78.0,10.0,65.0,0.0,Low,68.0,4.6,92.0,False
3,4.0,Michael,Male,92.0,25.0,90.0,3.0,High,92.0,2.9,96.0,False
4,5.0,Emma,Female,,18.0,82.0,2.0,Medium,85.0,4.1,97.0,True


In [14]:
df.isnull().sum()

StudentID                    40
Name                         34
Gender                       48
AttendanceRate               40
StudyHoursPerWeek            50
PreviousGrade                33
ExtracurricularActivities    43
ParentalSupport              22
FinalGrade                   40
Study Hours                  24
Attendance (%)               41
Online Classes Taken         25
dtype: int64

In [15]:
# Prepare data for machine learning

# Create binary target: Pass (1) if FinalGrade >= 70, else Fail (0)
df_clean = df.dropna(subset=['FinalGrade'])
df_clean['Pass'] = (df_clean['FinalGrade'] >= 70).astype(int)

# Select features
feature_cols = ['AttendanceRate', 'StudyHoursPerWeek', 'PreviousGrade', 
                'ExtracurricularActivities', 'Study Hours', 'Attendance (%)']

# Clean data
df_clean = df_clean.dropna(subset=feature_cols)

# Prepare X and y
X = df_clean[feature_cols].values
y = df_clean['Pass'].values

# Add Gender encoding
if 'Gender' in df_clean.columns:
    le = LabelEncoder()
    gender_encoded = le.fit_transform(df_clean['Gender'].fillna('Unknown'))
    X = np.column_stack([X, gender_encoded])



In [21]:
print(f"Features shape: {X.shape}")


Features shape: (767, 7)


In [22]:
print(f"Target shape: {y.shape}")


Target shape: (767,)


In [23]:
print(f"Class Distribution:")
print(f"  Pass (1): {sum(y)} samples ({sum(y)/len(y)*100:.2f}%)")
print(f"  Fail (0): {len(y)-sum(y)} samples ({(len(y)-sum(y))/len(y)*100:.2f}%)")

Class Distribution:
  Pass (1): 615 samples (80.18%)
  Fail (0): 152 samples (19.82%)
