# Predicting Student Test Scores
This Kaggle Playground Series hackathon focuses on building machine learning models to predict student test scores using a structured tabular dataset. The training and test data were synthetically generated from a deep learning model trained on the original Exam Score Prediction dataset. While the feature distributions closely resemble the original data, they are not identical, which introduces realistic noise and variation. Participants are encouraged to explore both the provided synthetic data and the original dataset to understand distributional differences and assess whether combining both sources improves predictive performance. The competition emphasizes experimentation, feature engineering, and model optimization in a controlled, learning-oriented environment.

Goal
The primary goal of this hackathon is to accurately predict students’ test scores based on the available features and achieve the best possible performance on the leaderboard evaluation metric.

Objectives
The specific objectives of the competition are:

To understand and analyze tabular educational data and its underlying feature distributions.

To apply data preprocessing, feature engineering, and exploratory data analysis techniques effectively.

To develop, train, and evaluate regression models capable of predicting student test scores with high accuracy.

To compare model performance when trained solely on the synthetic dataset versus a combination of synthetic and original datasets.

To encourage iterative experimentation and practical machine learning skills development within the Kaggle Playground Series framework.

# EXPLORATORY DATA ANALYSIS

In [1]:
# Lib Import
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
# Data Import
df_train = pd.read_csv("../data/raw/train.csv")
df_test = pd.read_csv("../data/raw/test.csv")

Unnamed: 0,id,age,gender,course,study_hours,class_attendance,internet_access,sleep_hours,sleep_quality,study_method,facility_rating,exam_difficulty,exam_score
0,0,21,female,b.sc,7.91,98.8,no,4.9,average,online videos,low,easy,78.3
1,1,18,other,diploma,4.95,94.8,yes,4.7,poor,self-study,medium,moderate,46.7
2,2,20,female,b.sc,4.68,92.6,yes,5.8,poor,coaching,high,moderate,99.0
3,3,19,male,b.sc,2.0,49.5,yes,8.3,average,group study,high,moderate,63.9
4,4,23,male,bca,7.65,86.9,yes,9.6,good,self-study,high,easy,100.0


In [5]:
df_train.head()

Unnamed: 0,id,age,gender,course,study_hours,class_attendance,internet_access,sleep_hours,sleep_quality,study_method,facility_rating,exam_difficulty,exam_score
0,0,21,female,b.sc,7.91,98.8,no,4.9,average,online videos,low,easy,78.3
1,1,18,other,diploma,4.95,94.8,yes,4.7,poor,self-study,medium,moderate,46.7
2,2,20,female,b.sc,4.68,92.6,yes,5.8,poor,coaching,high,moderate,99.0
3,3,19,male,b.sc,2.0,49.5,yes,8.3,average,group study,high,moderate,63.9
4,4,23,male,bca,7.65,86.9,yes,9.6,good,self-study,high,easy,100.0


In [6]:
df_test.head()

Unnamed: 0,id,age,gender,course,study_hours,class_attendance,internet_access,sleep_hours,sleep_quality,study_method,facility_rating,exam_difficulty
0,630000,24,other,ba,6.85,65.2,yes,5.2,poor,group study,high,easy
1,630001,18,male,diploma,6.61,45.0,no,9.3,poor,coaching,low,easy
2,630002,24,female,b.tech,6.6,98.5,yes,6.2,good,group study,medium,moderate
3,630003,24,male,diploma,3.03,66.3,yes,5.7,average,mixed,medium,moderate
4,630004,20,female,b.tech,2.03,42.4,yes,9.2,average,coaching,low,moderate


In [8]:
print(df_test.duplicated().sum())
print(df_train.duplicated().sum())

0
0


In [9]:
print(df_train.isnull().sum())
print(df_test.isnull().sum())

id                  0
age                 0
gender              0
course              0
study_hours         0
class_attendance    0
internet_access     0
sleep_hours         0
sleep_quality       0
study_method        0
facility_rating     0
exam_difficulty     0
exam_score          0
dtype: int64
id                  0
age                 0
gender              0
course              0
study_hours         0
class_attendance    0
internet_access     0
sleep_hours         0
sleep_quality       0
study_method        0
facility_rating     0
exam_difficulty     0
dtype: int64


In [10]:
df_train.describe()

Unnamed: 0,id,age,study_hours,class_attendance,sleep_hours,exam_score
count,630000.0,630000.0,630000.0,630000.0,630000.0,630000.0
mean,314999.5,20.545821,4.002337,71.987261,7.072758,62.506672
std,181865.479132,2.260238,2.35988,17.430098,1.744811,18.916884
min,0.0,17.0,0.08,40.6,4.1,19.599
25%,157499.75,19.0,1.97,57.0,5.6,48.8
50%,314999.5,21.0,4.0,72.6,7.1,62.6
75%,472499.25,23.0,6.05,87.2,8.6,76.3
max,629999.0,24.0,7.91,99.4,9.9,100.0


In [11]:
df_test.describe()

Unnamed: 0,id,age,study_hours,class_attendance,sleep_hours
count,270000.0,270000.0,270000.0,270000.0,270000.0
mean,764999.5,20.544137,4.003878,71.982509,7.07207
std,77942.430678,2.260452,2.357741,17.414695,1.745513
min,630000.0,17.0,0.08,40.6,4.1
25%,697499.75,19.0,1.98,57.0,5.6
50%,764999.5,21.0,4.01,72.6,7.1
75%,832499.25,23.0,6.05,87.2,8.6
max,899999.0,24.0,7.91,99.4,9.9


## Dataset Insights: Predicting Student Test Scores

### 1. Dataset Size and Structure
- **Training set:** 630,000 rows with 5 input features and 1 target (`exam_score`).
- **Test set:** 270,000 rows with the same 5 input features.
- The large sample size provides strong statistical reliability and supports complex models without high overfitting risk.

### 2. Train–Test Distribution Consistency
- Feature means, standard deviations, quartiles, and ranges are nearly identical across train and test sets.
- This confirms **no significant distribution shift**, indicating that models trained on the training data should generalize well to the test data.

### 3. Feature-Level Insights

#### `id`
- Acts only as a unique identifier.
- Shows different numeric ranges between train and test but carries **no predictive information**.
- Should be excluded from modeling.

#### `age`
- Range: **17–24 years**, mean ≈ **20.5**.
- Low variance and narrow range suggest limited standalone predictive power.
- May still contribute through interactions with behavioral features.

#### `study_hours`
- Range: **0.08–7.91 hours**, mean ≈ **4.0**.
- High variability relative to the mean.
- Likely one of the **strongest predictors** of exam performance.

#### `class_attendance`
- Range: **40.6%–99.4%**, mean ≈ **72%**.
- Well-distributed and centered.
- Expected to have a **strong positive correlation** with exam scores.

#### `sleep_hours`
- Range: **4.1–9.9 hours**, mean ≈ **7.07**.
- Moderate spread with most values in a healthy sleep range.
- Relationship with exam score may be **nonlinear** (too little or too much sleep may reduce performance).

### 4. Target Variable (`exam_score`)
- Range: **19.6–100**, mean ≈ **62.5**, median ≈ **62.6**.
- Wide spread indicates diverse performance levels.
- Mean and median alignment suggests a **roughly symmetric distribution**, suitable for RMSE/MAE optimization.

### 5. Modeling Implications
- Random train–validation splits are appropriate due to stable distributions.
- Feature scaling is useful for linear and distance-based models.
- Tree-based and boosting models (e.g., Random Forest, XGBoost, LightGBM) are well-suited due to:
  - Nonlinear relationships
  - Feature interactions
- Behavioral features (`study_hours`, `class_attendance`, `sleep_hours`) are expected to dominate model importance.

### 6. Overall Insight
- The dataset is clean, well-balanced, and intentionally designed for regression experimentation.
- Success in this competition depends more on **feature interactions and model choice** than on heavy data cleaning.
- The problem is ideal for benchmarking regression techniques on large-scale tabular data.


In [12]:
df_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 630000 entries, 0 to 629999
Data columns (total 13 columns):
 #   Column            Non-Null Count   Dtype  
---  ------            --------------   -----  
 0   id                630000 non-null  int64  
 1   age               630000 non-null  int64  
 2   gender            630000 non-null  object 
 3   course            630000 non-null  object 
 4   study_hours       630000 non-null  float64
 5   class_attendance  630000 non-null  float64
 6   internet_access   630000 non-null  object 
 7   sleep_hours       630000 non-null  float64
 8   sleep_quality     630000 non-null  object 
 9   study_method      630000 non-null  object 
 10  facility_rating   630000 non-null  object 
 11  exam_difficulty   630000 non-null  object 
 12  exam_score        630000 non-null  float64
dtypes: float64(4), int64(2), object(7)
memory usage: 62.5+ MB


In [13]:
df_test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 270000 entries, 0 to 269999
Data columns (total 12 columns):
 #   Column            Non-Null Count   Dtype  
---  ------            --------------   -----  
 0   id                270000 non-null  int64  
 1   age               270000 non-null  int64  
 2   gender            270000 non-null  object 
 3   course            270000 non-null  object 
 4   study_hours       270000 non-null  float64
 5   class_attendance  270000 non-null  float64
 6   internet_access   270000 non-null  object 
 7   sleep_hours       270000 non-null  float64
 8   sleep_quality     270000 non-null  object 
 9   study_method      270000 non-null  object 
 10  facility_rating   270000 non-null  object 
 11  exam_difficulty   270000 non-null  object 
dtypes: float64(3), int64(2), object(7)
memory usage: 24.7+ MB
