# Predict Student Test Score (Kaggle Playground Series)

This notebook explores the impact of synthetic vs original data on model performance in a Kaggle Playground Series competition.

The dataset provided in the competition is synthetically generated using deep learning models, with a link to the original real-world dataset.

Instead of focusing solely on leaderboard performance, the goal of this notebook is to compare how different data sources affect model behavior and evaluation metrics.

In this notebook, three different datasets are analyzed:

* Synthetic data (official competition dataset)

* Original real-world data

* A combined dataset created by merging synthetic and original data

Each dataset undergoes the same preprocessing steps and is evaluated using the same models to ensure a fair comparison.

The evaluation metric used throughout the notebook is RMSE, consistent with the competition metric.

## Exploratory Data Analysis (EDA)

### Importing Libraries

In [77]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import OneHotEncoder, MinMaxScaler
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import root_mean_squared_error
import optuna
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from xgboost import XGBRegressor
from lightgbm import LGBMRegressor
import mlflow
import mlflow.sklearn
from mlflow.models.signature import infer_signature


In [40]:
Synt_data = pd.read_csv("data/train.csv")
original_data = pd.read_csv("data/Exam_Score_Prediction.csv")
print(f"Synthetic (Competition) Data = {Synt_data.shape}")
print(f"Original Data = {original_data.shape}")

Synthetic (Competition) Data = (630000, 13)
Original Data = (20000, 13)


In [41]:
Synt_data.head()

Unnamed: 0,id,age,gender,course,study_hours,class_attendance,internet_access,sleep_hours,sleep_quality,study_method,facility_rating,exam_difficulty,exam_score
0,0,21,female,b.sc,7.91,98.8,no,4.9,average,online videos,low,easy,78.3
1,1,18,other,diploma,4.95,94.8,yes,4.7,poor,self-study,medium,moderate,46.7
2,2,20,female,b.sc,4.68,92.6,yes,5.8,poor,coaching,high,moderate,99.0
3,3,19,male,b.sc,2.0,49.5,yes,8.3,average,group study,high,moderate,63.9
4,4,23,male,bca,7.65,86.9,yes,9.6,good,self-study,high,easy,100.0


In [42]:
original_data.head()

Unnamed: 0,student_id,age,gender,course,study_hours,class_attendance,internet_access,sleep_hours,sleep_quality,study_method,facility_rating,exam_difficulty,exam_score
0,1,17,male,diploma,2.78,92.9,yes,7.4,poor,coaching,low,hard,58.9
1,2,23,other,bca,3.37,64.8,yes,4.6,average,online videos,medium,moderate,54.8
2,3,22,male,b.sc,7.88,76.8,yes,8.5,poor,coaching,high,moderate,90.3
3,4,20,other,diploma,0.67,48.4,yes,5.8,average,online videos,low,moderate,29.7
4,5,20,female,diploma,0.89,71.6,yes,9.8,poor,coaching,low,moderate,43.7


In [43]:
original_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20000 entries, 0 to 19999
Data columns (total 13 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   student_id        20000 non-null  int64  
 1   age               20000 non-null  int64  
 2   gender            20000 non-null  object 
 3   course            20000 non-null  object 
 4   study_hours       20000 non-null  float64
 5   class_attendance  20000 non-null  float64
 6   internet_access   20000 non-null  object 
 7   sleep_hours       20000 non-null  float64
 8   sleep_quality     20000 non-null  object 
 9   study_method      20000 non-null  object 
 10  facility_rating   20000 non-null  object 
 11  exam_difficulty   20000 non-null  object 
 12  exam_score        20000 non-null  float64
dtypes: float64(4), int64(2), object(7)
memory usage: 2.0+ MB


In [44]:
Synt_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 630000 entries, 0 to 629999
Data columns (total 13 columns):
 #   Column            Non-Null Count   Dtype  
---  ------            --------------   -----  
 0   id                630000 non-null  int64  
 1   age               630000 non-null  int64  
 2   gender            630000 non-null  object 
 3   course            630000 non-null  object 
 4   study_hours       630000 non-null  float64
 5   class_attendance  630000 non-null  float64
 6   internet_access   630000 non-null  object 
 7   sleep_hours       630000 non-null  float64
 8   sleep_quality     630000 non-null  object 
 9   study_method      630000 non-null  object 
 10  facility_rating   630000 non-null  object 
 11  exam_difficulty   630000 non-null  object 
 12  exam_score        630000 non-null  float64
dtypes: float64(4), int64(2), object(7)
memory usage: 62.5+ MB


## Data Preprocessing

### Handling Missing Values

Missing values are handled using simple and robust strategies suitable for tabular data.

The same imputation logic is applied to all datasets to maintain consistency.

In [None]:
print("*********** Synthetic Data Missing Values *************")
print(Synt_data.isnull().sum())

print("*********** Original Data Missing Values *************")
print(original_data.isnull().sum())

*********** Syntetic Data Missing Values *************
id                  0
age                 0
gender              0
course              0
study_hours         0
class_attendance    0
internet_access     0
sleep_hours         0
sleep_quality       0
study_method        0
facility_rating     0
exam_difficulty     0
exam_score          0
dtype: int64
*********** Original Data Missing Values *************
student_id          0
age                 0
gender              0
course              0
study_hours         0
class_attendance    0
internet_access     0
sleep_hours         0
sleep_quality       0
study_method        0
facility_rating     0
exam_difficulty     0
exam_score          0
dtype: int64


### Outlier Detection & Handling

Outlier detection was performed to identify potential extreme values that could negatively impact model performance.

Standard statistical methods were used to examine the distribution of numerical features.

After analysis, no significant outliers were detected that required removal or transformation.
Therefore, all observations were retained to preserve the original data distribution.

In [46]:
def outliers_thresholds(dataframe, varabile, q1=0.05, q3=0.95):
    quartile1 = dataframe[varabile].quantile(q1)
    quartile3 = dataframe[varabile].quantile(q3)

    iqr = quartile3 - quartile1
    up_limit = quartile3 + 1.5 * iqr
    low_limit = quartile1 - 1.5 * iqr

    return up_limit, low_limit

def check_outliers(dataframe, varabile):
    up_limit, low_limit = outliers_thresholds(dataframe=dataframe, varabile= varabile)

    if dataframe[(dataframe[varabile] < low_limit) | (dataframe[varabile] > up_limit)].any(axis=None):
        return True
    else:
        return False

In [None]:
print("Check Outliers on Synthetic Data")
num_col = [col for col in Synt_data.columns if Synt_data[col].dtype != "O"]
for col in num_col:
    print(col, check_outliers(Synt_data, col))

print("******************************************")
print("Check Outliers on Original Data")
num_col = [col for col in original_data.columns if original_data[col].dtype != "O"]
for col in num_col:
    print(col, check_outliers(original_data, col))

Check Outliers on Syntetic Data
id False
age False
study_hours False
class_attendance False
sleep_hours False
exam_score False
******************************************
Check Outliers on Original Data
student_id False
age False
study_hours False
class_attendance False
sleep_hours False
exam_score False


### Feature Engineering

Simple feature engineering techniques are applied to improve model expressiveness while avoiding excessive complexity.

The focus is on interpretability rather than aggressive feature creation.

#### Combined Data

In [106]:
Synt_data.drop("id", axis=1, inplace=True)
original_data.drop("student_id", axis=1, inplace=True)
combined_data = pd.concat([Synt_data, original_data], axis=0)
print(f"df = {combined_data.shape}")

KeyError: "['id'] not found in axis"

In [107]:
combined_data.head()

Unnamed: 0,age,study_hours,class_attendance,sleep_hours,exam_score,NEW_academic_engagement,NEW_age_study_interaction,NEW_age_class_interaction,gender_male,gender_other,...,sleep_quality_good,sleep_quality_poor,study_method_group study,study_method_mixed,study_method_online videos,study_method_self-study,facility_rating_low,facility_rating_medium,exam_difficulty_hard,exam_difficulty_moderate
0,21,7.91,98.8,4.9,78.3,0.993939,0.874098,0.81668,0.0,0.0,...,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0
1,18,4.95,94.8,4.7,46.7,0.595158,0.465514,0.599387,0.0,1.0,...,0.0,1.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,1.0
2,20,4.68,92.6,5.8,99.0,0.549319,0.489389,0.685266,0.0,0.0,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
3,19,2.0,49.5,8.3,63.9,0.122288,0.194397,0.147635,1.0,0.0,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
4,23,7.65,86.9,9.6,100.0,0.844868,0.926305,0.771794,1.0,0.0,...,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0


#### Feature Engineering for Combined Data

In [51]:
combined_data['NEW_academic_engagement'] = combined_data['study_hours'] * combined_data['class_attendance'] 
combined_data['NEW_age_study_interaction'] = combined_data['age'] * combined_data['study_hours']
combined_data['NEW_age_class_interaction'] = combined_data['age'] * combined_data['class_attendance']
combined_data.head()

Unnamed: 0,age,gender,course,study_hours,class_attendance,internet_access,sleep_hours,sleep_quality,study_method,facility_rating,exam_difficulty,exam_score,NEW_academic_engagement,NEW_age_study_interaction,NEW_age_class_interaction
0,21,female,b.sc,7.91,98.8,no,4.9,average,online videos,low,easy,78.3,781.508,166.11,2074.8
1,18,other,diploma,4.95,94.8,yes,4.7,poor,self-study,medium,moderate,46.7,469.26,89.1,1706.4
2,20,female,b.sc,4.68,92.6,yes,5.8,poor,coaching,high,moderate,99.0,433.368,93.6,1852.0
3,19,male,b.sc,2.0,49.5,yes,8.3,average,group study,high,moderate,63.9,99.0,38.0,940.5
4,23,male,bca,7.65,86.9,yes,9.6,good,self-study,high,easy,100.0,664.785,175.95,1998.7


#### Feature Engineering for Synthetic and Original Data

In [52]:
original_data['NEW_academic_engagement'] = original_data['study_hours'] * original_data['class_attendance'] 
original_data['NEW_age_study_interaction'] = original_data['age'] * original_data['study_hours']
original_data['NEW_age_class_interaction'] = original_data['age'] * original_data['class_attendance']

Synt_data['NEW_academic_engagement'] = Synt_data['study_hours'] * Synt_data['class_attendance'] 
Synt_data['NEW_age_study_interaction'] = Synt_data['age'] * Synt_data['study_hours']
Synt_data['NEW_age_class_interaction'] = Synt_data['age'] * Synt_data['class_attendance']

original_data.head()

Unnamed: 0,age,gender,course,study_hours,class_attendance,internet_access,sleep_hours,sleep_quality,study_method,facility_rating,exam_difficulty,exam_score,NEW_academic_engagement,NEW_age_study_interaction,NEW_age_class_interaction
0,17,male,diploma,2.78,92.9,yes,7.4,poor,coaching,low,hard,58.9,258.262,47.26,1579.3
1,23,other,bca,3.37,64.8,yes,4.6,average,online videos,medium,moderate,54.8,218.376,77.51,1490.4
2,22,male,b.sc,7.88,76.8,yes,8.5,poor,coaching,high,moderate,90.3,605.184,173.36,1689.6
3,20,other,diploma,0.67,48.4,yes,5.8,average,online videos,low,moderate,29.7,32.428,13.4,968.0
4,20,female,diploma,0.89,71.6,yes,9.8,poor,coaching,low,moderate,43.7,63.724,17.8,1432.0


In [53]:
Synt_data.head()

Unnamed: 0,age,gender,course,study_hours,class_attendance,internet_access,sleep_hours,sleep_quality,study_method,facility_rating,exam_difficulty,exam_score,NEW_academic_engagement,NEW_age_study_interaction,NEW_age_class_interaction
0,21,female,b.sc,7.91,98.8,no,4.9,average,online videos,low,easy,78.3,781.508,166.11,2074.8
1,18,other,diploma,4.95,94.8,yes,4.7,poor,self-study,medium,moderate,46.7,469.26,89.1,1706.4
2,20,female,b.sc,4.68,92.6,yes,5.8,poor,coaching,high,moderate,99.0,433.368,93.6,1852.0
3,19,male,b.sc,2.0,49.5,yes,8.3,average,group study,high,moderate,63.9,99.0,38.0,940.5
4,23,male,bca,7.65,86.9,yes,9.6,good,self-study,high,easy,100.0,664.785,175.95,1998.7


### Categorical Encoding 

In [78]:
ohe_encode = OneHotEncoder(handle_unknown="ignore", sparse_output=False, drop="first")
def ohe_encoder(dataframe, cat_columns):
    encode_columns = ohe_encode.fit_transform(dataframe[cat_columns])
    encode_columns = pd.DataFrame(encode_columns, columns=ohe_encode.get_feature_names_out(cat_columns), index=dataframe.index)

    return encode_columns

#### Encoding for combined data

In [55]:
category_columns_combined = [col for col in combined_data.columns if combined_data[col].dtype == "O"]
encode_combined_cat = ohe_encoder(combined_data, category_columns_combined)
encode_combined_cat.head()

Unnamed: 0,gender_male,gender_other,course_b.sc,course_b.tech,course_ba,course_bba,course_bca,course_diploma,internet_access_yes,sleep_quality_good,sleep_quality_poor,study_method_group study,study_method_mixed,study_method_online videos,study_method_self-study,facility_rating_low,facility_rating_medium,exam_difficulty_hard,exam_difficulty_moderate
0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0
1,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,1.0
2,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
3,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
4,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0


In [56]:
# In Combined Data, let's combine the columns we encoded with the columns we didn't encode(numeric columns).
combined_data = combined_data.drop(category_columns_combined, axis=1)
combined_data = pd.concat([combined_data, encode_combined_cat], axis=1)
combined_data.head()

Unnamed: 0,age,study_hours,class_attendance,sleep_hours,exam_score,NEW_academic_engagement,NEW_age_study_interaction,NEW_age_class_interaction,gender_male,gender_other,...,sleep_quality_good,sleep_quality_poor,study_method_group study,study_method_mixed,study_method_online videos,study_method_self-study,facility_rating_low,facility_rating_medium,exam_difficulty_hard,exam_difficulty_moderate
0,21,7.91,98.8,4.9,78.3,781.508,166.11,2074.8,0.0,0.0,...,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0
1,18,4.95,94.8,4.7,46.7,469.26,89.1,1706.4,0.0,1.0,...,0.0,1.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,1.0
2,20,4.68,92.6,5.8,99.0,433.368,93.6,1852.0,0.0,0.0,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
3,19,2.0,49.5,8.3,63.9,99.0,38.0,940.5,1.0,0.0,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
4,23,7.65,86.9,9.6,100.0,664.785,175.95,1998.7,1.0,0.0,...,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0


#### Encoding for Synthetic and Original Data

In [57]:
category_columns_original = [col for col in original_data.columns if original_data[col].dtype == "O"]
encode_original_cat = ohe_encoder(original_data, category_columns_original)

original_data = original_data.drop(category_columns_original, axis=1)
original_data = pd.concat([original_data, encode_original_cat], axis=1)
original_data.head()

Unnamed: 0,age,study_hours,class_attendance,sleep_hours,exam_score,NEW_academic_engagement,NEW_age_study_interaction,NEW_age_class_interaction,gender_male,gender_other,...,sleep_quality_good,sleep_quality_poor,study_method_group study,study_method_mixed,study_method_online videos,study_method_self-study,facility_rating_low,facility_rating_medium,exam_difficulty_hard,exam_difficulty_moderate
0,17,2.78,92.9,7.4,58.9,258.262,47.26,1579.3,1.0,0.0,...,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0
1,23,3.37,64.8,4.6,54.8,218.376,77.51,1490.4,0.0,1.0,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,1.0
2,22,7.88,76.8,8.5,90.3,605.184,173.36,1689.6,1.0,0.0,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
3,20,0.67,48.4,5.8,29.7,32.428,13.4,968.0,0.0,1.0,...,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0
4,20,0.89,71.6,9.8,43.7,63.724,17.8,1432.0,0.0,0.0,...,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0


In [58]:
category_columns_synt = [col for col in Synt_data.columns if Synt_data[col].dtype == "O"]
encode_synt_cat = ohe_encoder(Synt_data, category_columns_synt)

Synt_data = Synt_data.drop(category_columns_synt, axis=1)
Synt_data = pd.concat([Synt_data, encode_synt_cat], axis=1)
Synt_data.head()

Unnamed: 0,age,study_hours,class_attendance,sleep_hours,exam_score,NEW_academic_engagement,NEW_age_study_interaction,NEW_age_class_interaction,gender_male,gender_other,...,sleep_quality_good,sleep_quality_poor,study_method_group study,study_method_mixed,study_method_online videos,study_method_self-study,facility_rating_low,facility_rating_medium,exam_difficulty_hard,exam_difficulty_moderate
0,21,7.91,98.8,4.9,78.3,781.508,166.11,2074.8,0.0,0.0,...,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0
1,18,4.95,94.8,4.7,46.7,469.26,89.1,1706.4,0.0,1.0,...,0.0,1.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,1.0
2,20,4.68,92.6,5.8,99.0,433.368,93.6,1852.0,0.0,0.0,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
3,19,2.0,49.5,8.3,63.9,99.0,38.0,940.5,1.0,0.0,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
4,23,7.65,86.9,9.6,100.0,664.785,175.95,1998.7,1.0,0.0,...,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0


### Feature Scaling

Numerical features that are sensitive to scale are normalized using standard scaling.

Tree-based models are generally scale-invariant, but scaling is applied for consistency and potential future model extensions.

In [79]:
min_max_scaler = MinMaxScaler()
def scaler(dataframe, numeric_columns):
    dataframe[numeric_columns] = min_max_scaler.fit_transform(dataframe[numeric_columns])
    return dataframe

In [None]:
num_columns = ["NEW_academic_engagement", "NEW_age_study_interaction", "NEW_age_class_interaction"]

scaler(original_data, num_columns)
scaler(combined_data, num_columns)
scaler(Synt_data, num_columns)

Unnamed: 0,age,study_hours,class_attendance,sleep_hours,exam_score,NEW_academic_engagement,NEW_age_study_interaction,NEW_age_class_interaction,gender_male,gender_other,...,sleep_quality_good,sleep_quality_poor,study_method_group study,study_method_mixed,study_method_online videos,study_method_self-study,facility_rating_low,facility_rating_medium,exam_difficulty_hard,exam_difficulty_moderate
0,17,2.78,92.9,7.4,58.9,0.325686,0.243527,0.524419,1.0,0.0,...,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0
1,23,3.37,64.8,4.6,54.8,0.274746,0.404022,0.471983,0.0,1.0,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,1.0
2,22,7.88,76.8,8.5,90.3,0.76875,0.912564,0.589477,1.0,0.0,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
3,20,0.67,48.4,5.8,29.7,0.037267,0.063879,0.163855,0.0,1.0,...,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0
4,20,0.89,71.6,9.8,43.7,0.077236,0.087224,0.437537,0.0,0.0,...,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0


## Baseline Model Comparison

In this section, multiple machine learning models are compared across three different datasets:
synthetic data, original data, and a combined dataset.

The goal of this step is not hyperparameter optimization, but to understand how different data sources affect baseline model performance.

For each dataset, the following models are evaluated using the same cross-validation strategy:

* Decision Tree

* Random Forest

* LightGBM

* XGBoost

Model performance is compared using RMSE (mean and standard deviation) to ensure a fair and consistent evaluation.

In [91]:
def choose_model(dataframe, target, random_state=15):
    X = dataframe.drop(target, axis=1)
    y = dataframe[target]

    X_train, X_val, y_train, y_val = train_test_split(X,y, test_size=0.2 , random_state=random_state)

    models = {
    "Decision Tree Regressor" : DecisionTreeRegressor(random_state=random_state),
    "Random Forest Regressor" : RandomForestRegressor(n_jobs=-1, random_state=random_state),
    "LightGBM Regressor": LGBMRegressor(random_state=random_state),
    "XGBoost Regressor": XGBRegressor(eval_metric="rmse", objective="reg:squarederror")
    }

    results = []
    for model_name, model in models.items():
        
        scores = cross_val_score(model,
                                X_train,
                                y_train,
                                cv=5,
                                scoring="neg_root_mean_squared_error")
        
        rmse = - scores
        results.append({"Model":model_name,
                    "RMSE mean": rmse.mean(),
                    "RMSE std": rmse.std()}) 
        
    return pd.DataFrame(results)

In [92]:
datasets = {
    'Synthetic Data' : Synt_data,
    'Original Data' : original_data,
    'Combined Data' : combined_data
}

all_results = []

for name, data in datasets.items():
    temp = choose_model(dataframe=data, target='exam_score')
    temp['dataset'] = name
    all_results.append(temp)

final_results = pd.concat(all_results, ignore_index=False)

[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.004263 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 1371
[LightGBM] [Info] Number of data points in the train set: 403200, number of used features: 26
[LightGBM] [Info] Start training from score 62.495085
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.004384 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 1371
[LightGBM] [Info] Number of data points in the train set: 403200, number of used features: 26
[LightGBM] [Info] Start training from score 62.526890
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.004929 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not e

In [93]:
final_results.sort_values(by='RMSE mean')

Unnamed: 0,Model,RMSE mean,RMSE std,dataset
2,LightGBM Regressor,8.830613,0.005553,Synthetic Data
3,XGBoost Regressor,8.835006,0.005749,Synthetic Data
2,LightGBM Regressor,8.856223,0.008956,Combined Data
3,XGBoost Regressor,8.856535,0.009659,Combined Data
1,Random Forest Regressor,9.175724,0.010029,Synthetic Data
1,Random Forest Regressor,9.190801,0.007637,Combined Data
2,LightGBM Regressor,10.012413,0.08871,Original Data
3,XGBoost Regressor,10.552949,0.117029,Original Data
1,Random Forest Regressor,10.64493,0.138495,Original Data
0,Decision Tree Regressor,12.816903,0.032838,Synthetic Data


## Hyperparameter Tuning

Hyperparameter optimization is performed only on the synthetic dataset.

This decision is intentional: based on the model comparison results, models trained on synthetic data consistently achieved better performance than those trained on original or combined datasets.

Therefore, further optimization efforts are focused on the dataset that demonstrated the strongest baseline results.

LightGBM and XGBoost models are tuned separately, and all tuning experiments are tracked using MLflow.

### LightGBM

In [None]:
X = Synt_data.drop('exam_score', axis=1)
y = Synt_data['exam_score']

In [97]:
def objective_lgbm(trial):
    params = {
        'max_depth': trial.suggest_int('max_depth', 3, 15),
        'learning_rate': trial.suggest_float('learning_rate', 0.005, 0.1, log=True),
        'n_estimators': trial.suggest_int('n_estimators', 500, 2000),
        'num_leaves': trial.suggest_int('num_leaves',30, 255 ),
        'min_child_samples': trial.suggest_int('min_child_samples', 5, 100),
        'metric' : 'rmse',
        'verbosity': -1,
        'random_state' : 15
        
    }

    model= LGBMRegressor(**params)

    scores = cross_val_score(model,
                             X,
                             y,
                             cv=3,
                             scoring="neg_root_mean_squared_error")
    
    rmse = - scores
    mean_rmse = rmse.mean()
    rmse_cv_std = rmse.std()

    trial.set_user_attr("rmse_std", rmse_cv_std)
    

    return mean_rmse


study_lgbm = optuna.create_study(direction='minimize')
optuna.logging.set_verbosity(optuna.logging.INFO)
study_lgbm.optimize(objective_lgbm, n_trials=30)


[32m[I 2026-01-28 21:41:45,537][0m A new study created in memory with name: no-name-cb41332a-1181-4866-8f02-6f653142cbc5[0m
[32m[I 2026-01-28 21:43:34,122][0m Trial 0 finished with value: 8.80143145131081 and parameters: {'max_depth': 7, 'learning_rate': 0.008687653044877851, 'n_estimators': 1789, 'num_leaves': 39, 'min_child_samples': 40}. Best is trial 0 with value: 8.80143145131081.[0m
[32m[I 2026-01-28 21:46:07,759][0m Trial 1 finished with value: 8.901391494253422 and parameters: {'max_depth': 13, 'learning_rate': 0.07767628639840253, 'n_estimators': 1941, 'num_leaves': 170, 'min_child_samples': 46}. Best is trial 0 with value: 8.80143145131081.[0m
[32m[I 2026-01-28 21:47:14,352][0m Trial 2 finished with value: 8.888925907906838 and parameters: {'max_depth': 10, 'learning_rate': 0.005466424935149777, 'n_estimators': 655, 'num_leaves': 154, 'min_child_samples': 48}. Best is trial 0 with value: 8.80143145131081.[0m
[32m[I 2026-01-28 21:49:08,793][0m Trial 3 finished wi

In [98]:
print(f"LightGBM best trial: {study_lgbm.best_trial}")
print(f"LightGBM best value: {study_lgbm.best_value}")
print(f"LightGBM best params: {study_lgbm.best_params}")

LightGBM best trial: FrozenTrial(number=12, state=<TrialState.COMPLETE: 1>, values=[8.773328903937037], datetime_start=datetime.datetime(2026, 1, 28, 22, 3, 26, 928419), datetime_complete=datetime.datetime(2026, 1, 28, 22, 4, 31, 8868), params={'max_depth': 9, 'learning_rate': 0.02867322647820237, 'n_estimators': 1091, 'num_leaves': 88, 'min_child_samples': 67}, user_attrs={'rmse_std': np.float64(0.009374642548343253)}, system_attrs={}, intermediate_values={}, distributions={'max_depth': IntDistribution(high=15, log=False, low=3, step=1), 'learning_rate': FloatDistribution(high=0.1, log=True, low=0.005, step=None), 'n_estimators': IntDistribution(high=2000, log=False, low=500, step=1), 'num_leaves': IntDistribution(high=255, log=False, low=30, step=1), 'min_child_samples': IntDistribution(high=100, log=False, low=5, step=1)}, trial_id=12, value=None)
LightGBM best value: 8.773328903937037
LightGBM best params: {'max_depth': 9, 'learning_rate': 0.02867322647820237, 'n_estimators': 1091,

### XGBoost

In [49]:
def objective_xgb(trial):
    params = {
    'max_depth': trial.suggest_int('max_depth', 3, 15),
    'learning_rate': trial.suggest_float('learning_rate', 0.005, 0.3, log=True),
    'n_estimators': trial.suggest_int('n_estimators', 500, 2000),
    'subsample': trial.suggest_float('subsample', 0.5, 1.0),
    'colsample_bytree': trial.suggest_float('colsample_bytree', 0.5, 1.0),
    'gamma': trial.suggest_float('gamma', 0, 5),
    'min_child_weight': trial.suggest_int('min_child_weight', 1, 10),
    'reg_alpha': trial.suggest_float('reg_alpha', 0.0, 1.0),
    'reg_lambda': trial.suggest_float('reg_lambda', 0.0, 1.0),
    'random_state': 15,
    'objective': 'reg:squarederror',
    'eval_metric': 'rmse'
}


    model= XGBRegressor(**params)

    scores = cross_val_score(model,
                             X,
                             y,
                             cv=3,
                             scoring="neg_root_mean_squared_error")
    
    rmse = - scores
    mean_rmse = rmse.mean()
    rmse_cv_std = rmse.std()

    trial.set_user_attr("rmse_std", rmse_cv_std)
    

    return mean_rmse

study_xgb = optuna.create_study(direction="minimize")
optuna.logging.set_verbosity(optuna.logging.INFO)
study_xgb.optimize(objective_xgb, n_trials=30)

[32m[I 2026-01-24 16:39:40,546][0m A new study created in memory with name: no-name-52b265f2-024a-4120-8468-028ee3bf7278[0m
[32m[I 2026-01-24 16:41:46,119][0m Trial 0 finished with value: 8.791423461113464 and parameters: {'max_depth': 6, 'learning_rate': 0.04523131138436962, 'n_estimators': 1762, 'subsample': 0.9986271684293708, 'colsample_bytree': 0.6921713085458234, 'gamma': 1.037542458637894, 'min_child_weight': 1, 'reg_alpha': 0.6077049215816768, 'reg_lambda': 0.25981321061648277}. Best is trial 0 with value: 8.791423461113464.[0m
[32m[I 2026-01-24 16:42:37,342][0m Trial 1 finished with value: 8.814991034621805 and parameters: {'max_depth': 6, 'learning_rate': 0.03292480402516286, 'n_estimators': 597, 'subsample': 0.601350786132907, 'colsample_bytree': 0.6371453908094977, 'gamma': 4.849314855027347, 'min_child_weight': 3, 'reg_alpha': 0.10250641124558346, 'reg_lambda': 0.001727151433340568}. Best is trial 0 with value: 8.791423461113464.[0m
[32m[I 2026-01-24 16:46:37,382

In [50]:
print(f"XgBoost best trial: {study_xgb.best_trial}")
print(f"XgBoost best value: {study_xgb.best_value}")
print(f"XgBoost best params: {study_xgb.best_params}")

XgBoost best trial: FrozenTrial(number=20, state=<TrialState.COMPLETE: 1>, values=[8.788996416040739], datetime_start=datetime.datetime(2026, 1, 24, 17, 33, 44, 537867), datetime_complete=datetime.datetime(2026, 1, 24, 17, 37, 35, 369593), params={'max_depth': 7, 'learning_rate': 0.02424077717406291, 'n_estimators': 1701, 'subsample': 0.9431521463961212, 'colsample_bytree': 0.9242022349003731, 'gamma': 0.03224024966953887, 'min_child_weight': 10, 'reg_alpha': 0.38556156070740527, 'reg_lambda': 0.8143180804100962}, user_attrs={'rmse_std': np.float64(0.04597024410371482)}, system_attrs={}, intermediate_values={}, distributions={'max_depth': IntDistribution(high=15, log=False, low=3, step=1), 'learning_rate': FloatDistribution(high=0.3, log=True, low=0.005, step=None), 'n_estimators': IntDistribution(high=2000, log=False, low=500, step=1), 'subsample': FloatDistribution(high=1.0, log=False, low=0.5, step=None), 'colsample_bytree': FloatDistribution(high=1.0, log=False, low=0.5, step=None)

## Final Model Training & Prediction

### Test Data Pre-processing

In this section, we apply preprocessing steps to predict the test dataset using XgBoost and LightGBM models.

In [101]:
test = pd.read_csv("data/test.csv")
test_id = test["id"]
test.drop("id", axis=1, inplace=True)

test['NEW_academic_engagement'] = test['study_hours'] * test['class_attendance'] 
test['NEW_age_study_interaction'] = test['age'] * test['study_hours']
test['NEW_age_class_interaction'] = test['age'] * test['class_attendance']

test_cat_columns = [col for col in test.columns if test[col].dtype == "O"]

test_encode_cat_col =ohe_encode.fit_transform(test[test_cat_columns])
test_encode_cat_col = pd.DataFrame(test_encode_cat_col, columns=ohe_encode.get_feature_names_out(test_cat_columns), index=test.index)


test.drop(test_cat_columns, axis=1, inplace=True)
test_data = pd.concat([test, test_encode_cat_col], axis=1)

test_num_columns = ["NEW_academic_engagement", "NEW_age_study_interaction", "NEW_age_class_interaction"]
scaler(test_data, test_num_columns)

X_features = X.columns
test_data.reindex(columns=X_features, fill_value=0)
test_data.head()

Unnamed: 0,age,study_hours,class_attendance,sleep_hours,NEW_academic_engagement,NEW_age_study_interaction,NEW_age_class_interaction,gender_male,gender_other,course_b.sc,...,sleep_quality_good,sleep_quality_poor,study_method_group study,study_method_mixed,study_method_online videos,study_method_self-study,facility_rating_low,facility_rating_medium,exam_difficulty_hard,exam_difficulty_moderate
0,24,6.85,65.2,5.2,0.566243,0.865025,0.515866,0.0,1.0,0.0,...,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,18,6.61,45.0,9.3,0.375734,0.624045,0.070662,1.0,0.0,0.0,...,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
2,24,6.6,98.5,6.2,0.826114,0.833192,0.98726,0.0,0.0,0.0,...,1.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0
3,24,3.03,66.3,5.7,0.252413,0.378608,0.531438,1.0,0.0,0.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,1.0
4,20,2.03,42.4,9.2,0.105777,0.208192,0.093075,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0


### LightGBM

In [102]:
final_model_lgbm = LGBMRegressor(**study_lgbm.best_params,
    random_state=15).fit(X,y)

y_pred_lgbm = final_model_lgbm.predict(test_data)

### XGBoost

In [51]:
final_model_xgb = LGBMRegressor(**study_xgb.best_params,
    random_state=15).fit(X,y)

y_pred_xgb = final_model_xgb.predict(test_data)

## Experiment Tracking with MLflow

MLflow is used to track experiments, including hyperparameters and evaluation metrics.

Due to Kaggle environment limitations, the MLflow UI is not displayed, but all runs are logged programmatically for reproducibility.

In [None]:
mlflow.set_experiment("Kaggle_exam_score_regression")

In [None]:
signature = infer_signature(X[:1], final_model_lgbm.predict(X[:1]))

with mlflow.start_run(run_name="Student exam Score pred - LightGBM"):
    mlflow.log_param("n_trials", len(study_lgbm.trials))
    mlflow.log_params(study_lgbm.best_params)
    mlflow.log_metric("rmse_cv_mean",study_lgbm.best_value)
    mlflow.log_metric(
        "rmse_cv_std",
        study_lgbm.best_trial.user_attrs["rmse_std"]
    )
    mlflow.set_tag("Model_type", final_model_lgbm.__class__.__name__)
    mlflow.set_tag("tuning", "optuna")
    mlflow.sklearn.log_model(sk_model=final_model_lgbm, name="LightGBM_Regressor", signature=signature)


2026/01/28 22:31:21 INFO alembic.runtime.plugins: setup plugin alembic.autogenerate.schemas
2026/01/28 22:31:21 INFO alembic.runtime.plugins: setup plugin alembic.autogenerate.tables
2026/01/28 22:31:21 INFO alembic.runtime.plugins: setup plugin alembic.autogenerate.types
2026/01/28 22:31:21 INFO alembic.runtime.plugins: setup plugin alembic.autogenerate.constraints
2026/01/28 22:31:21 INFO alembic.runtime.plugins: setup plugin alembic.autogenerate.defaults
2026/01/28 22:31:21 INFO alembic.runtime.plugins: setup plugin alembic.autogenerate.comments
2026/01/28 22:31:22 INFO mlflow.store.db.utils: Creating initial MLflow database tables...
2026/01/28 22:31:22 INFO mlflow.store.db.utils: Updating database tables
2026/01/28 22:31:22 INFO alembic.runtime.migration: Context impl SQLiteImpl.
2026/01/28 22:31:22 INFO alembic.runtime.migration: Will assume non-transactional DDL.
2026/01/28 22:31:22 INFO alembic.runtime.migration: Context impl SQLiteImpl.
2026/01/28 22:31:22 INFO alembic.runtime

In [None]:
signature = infer_signature(X[:1], final_model_xgb.predict(X[:1]))

with mlflow.start_run(run_name="Student exam Score pred - LightGBM"):
    mlflow.log_param("n_trials", len(study_xgb.trials))
    mlflow.log_params(study_xgb.best_params)
    mlflow.log_metric("rmse_cv_mean",study_xgb.best_value)
    mlflow.log_metric(
        "rmse_cv_std",
        study_xgb.best_trial.user_attrs["rmse_std"]
    )
    mlflow.set_tag("Model_type", final_model_xgb.__class__.__name__)
    mlflow.set_tag("tuning", "optuna")
    mlflow.sklearn.log_model(sk_model=final_model_xgb, name="XGBoost_Regressor", signature=signature)

## Submission Files

Separate submission files are generated for LightGBM and XGBoost models.

This allows independent evaluation and comparison of model performance.

In [104]:
submission = pd.DataFrame({
    "id": test_id,
    "exam_score": y_pred_lgbm
})

submission.to_csv("submission_lightGBM.csv", index=False)

In [None]:
submission = pd.DataFrame({
    "id": test_id,
    "exam_score": y_pred_xgb
})

submission.to_csv("submission_XGBoost.csv", index=False)