# Diabetes Prediction â€“ Kaggle Playground Series S5E12

## Project Overview

This notebook is part of the Kaggle competition **Playground Series - Season 5, Episode 12**.
The goal is to build a machine learning model capable of predicting whether a patient has diabetes based on medical features.

This project demonstrates the full data science workflow:
- Data exploration
- Feature preprocessing
- Model training
- Performance evaluation

This notebook is intended both for competition purposes and as a **portfolio project**.

kaggle : https://www.kaggle.com/competitions/playground-series-s5e12/data?select=test.csv

## Libraries Import and Data Loading

We start by importing the necessary Python libraries for data manipulation, visualization, and machine learning.

The dataset is provided by Kaggle and consists of:
- A training set with features and target variable
- A test set without labels, used for submission


In [7]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

try:
    df_train = pd.read_csv('data/train.csv')
    df_test = pd.read_csv('data/test.csv')
    print("Data charged with sucess!")
except FileNotFoundError:
    print("Files not found")

Data charged with sucess!


## Exploratory Data Analysis (EDA)

In this section, we explore the dataset to better understand:
- Data structure
- Feature distributions
- Missing values
- Potential outliers


### Dataset Overview

In [8]:
df_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 700000 entries, 0 to 699999
Data columns (total 26 columns):
 #   Column                              Non-Null Count   Dtype  
---  ------                              --------------   -----  
 0   id                                  700000 non-null  int64  
 1   age                                 700000 non-null  int64  
 2   alcohol_consumption_per_week        700000 non-null  int64  
 3   physical_activity_minutes_per_week  700000 non-null  int64  
 4   diet_score                          700000 non-null  float64
 5   sleep_hours_per_day                 700000 non-null  float64
 6   screen_time_hours_per_day           700000 non-null  float64
 7   bmi                                 700000 non-null  float64
 8   waist_to_hip_ratio                  700000 non-null  float64
 9   systolic_bp                         700000 non-null  int64  
 10  diastolic_bp                        700000 non-null  int64  
 11  heart_rate                

In [9]:
df_train.describe()

Unnamed: 0,id,age,alcohol_consumption_per_week,physical_activity_minutes_per_week,diet_score,sleep_hours_per_day,screen_time_hours_per_day,bmi,waist_to_hip_ratio,systolic_bp,diastolic_bp,heart_rate,cholesterol_total,hdl_cholesterol,ldl_cholesterol,triglycerides,family_history_diabetes,hypertension_history,cardiovascular_history,diagnosed_diabetes
count,700000.0,700000.0,700000.0,700000.0,700000.0,700000.0,700000.0,700000.0,700000.0,700000.0,700000.0,700000.0,700000.0,700000.0,700000.0,700000.0,700000.0,700000.0,700000.0,700000.0
mean,349999.5,50.359734,2.072411,80.230803,5.963695,7.0022,6.012733,25.874684,0.858766,116.294193,75.440924,70.167749,186.818801,53.823214,102.905854,123.08185,0.149401,0.18199,0.030324,0.623296
std,202072.738554,11.65552,1.048189,51.195071,1.463336,0.901907,2.022707,2.860705,0.03798,11.01039,6.825775,6.938722,16.730832,8.266545,19.022416,24.739397,0.356484,0.385837,0.171478,0.48456
min,0.0,19.0,1.0,1.0,0.1,3.1,0.6,15.1,0.68,91.0,51.0,42.0,117.0,21.0,51.0,31.0,0.0,0.0,0.0,0.0
25%,174999.75,42.0,1.0,49.0,5.0,6.4,4.6,23.9,0.83,108.0,71.0,65.0,175.0,48.0,89.0,106.0,0.0,0.0,0.0,0.0
50%,349999.5,50.0,2.0,71.0,6.0,7.0,6.0,25.9,0.86,116.0,75.0,70.0,187.0,54.0,103.0,123.0,0.0,0.0,0.0,1.0
75%,524999.25,58.0,3.0,96.0,7.0,7.6,7.4,27.8,0.88,124.0,80.0,75.0,199.0,59.0,116.0,139.0,0.0,0.0,0.0,1.0
max,699999.0,89.0,9.0,747.0,9.9,9.9,16.5,38.4,1.05,163.0,104.0,101.0,289.0,90.0,205.0,290.0,1.0,1.0,1.0,1.0


### First observation 
The maximum BMI is consistent with obesity and diabetes, and the blood pressure and physical activity ranges are entirely plausible. All numerical columns are normal, and the target variable is confirmed to be binary (0 or 1).

In [10]:
categorical_cols = [
    'gender', 
    'ethnicity', 
    'education_level', 
    'income_level', 
    'smoking_status', 
    'employment_status'
]

for col in categorical_cols:
    print(f"\n--- {col} ---")
    print(df_train[col].value_counts())


--- gender ---
gender
Female    363237
Male      333085
Other       3678
Name: count, dtype: int64

--- ethnicity ---
ethnicity
White       386153
Hispanic    129984
Black       106301
Asian        60120
Other        17442
Name: count, dtype: int64

--- education_level ---
education_level
Highschool      344145
Graduate        261268
Postgraduate     79642
No formal        14945
Name: count, dtype: int64

--- income_level ---
income_level
Middle          290557
Lower-Middle    178570
Upper-Middle    127836
Low              85803
High             17234
Name: count, dtype: int64

--- smoking_status ---
smoking_status
Never      494448
Current    103363
Former     102189
Name: count, dtype: int64

--- employment_status ---
employment_status
Employed      516170
Retired       115735
Unemployed     49787
Student        18308
Name: count, dtype: int64


##  Data Preprocessing

Before training the models, several preprocessing steps are applied:
- Handling missing values
- Feature scaling
- Encoding if necessary

These steps ensure that the data is suitable for machine learning algorithms.


Label Encoding : `education_level` , `income_level`, `smoking_statut`


One-Hot Encoding : `gender`, `ethnicity`, `employment_status`

In [11]:
income_mapping = {
    'Low': 1,
    'Lower-Middle': 2,
    'Middle': 3,
    'Upper-Middle': 4,
    'High': 5
}

df_train['income_level_encoded'] = df_train['income_level'].map(income_mapping)

print("The first 5 lines of the dataset after income_level encoding:")
print(df_train[['income_level', 'income_level_encoded']].head())

The first 5 lines of the dataset after income_level encoding:
   income_level  income_level_encoded
0  Lower-Middle                     2
1  Upper-Middle                     4
2  Lower-Middle                     2
3  Lower-Middle                     2
4  Upper-Middle                     4


In [12]:
education_mapping = {
    'No formal': 0,
    'Highschool': 1,
    'Graduate': 2,
    'Postgraduate': 3
}

df_train['education_level_encoded'] = df_train['education_level'].map(education_mapping)

print("The first 5 lines of the dataset after education_level encoding:")
print(df_train[['education_level', 'education_level_encoded']].head())

The first 5 lines of the dataset after education_level encoding:
  education_level  education_level_encoded
0      Highschool                        1
1      Highschool                        1
2      Highschool                        1
3      Highschool                        1
4      Highschool                        1


In [13]:
smoking_mapping = {
    'Never': 0,
    'Current': 1,
    'Former': 2
}

df_train['smoking_status_encoded'] = df_train['smoking_status'].map(smoking_mapping)

print("The first 5 lines of the dataset after smoking_status encoding:")
print(df_train[['smoking_status', 'smoking_status_encoded']].head())

The first 5 lines of the dataset after smoking_status encoding:
  smoking_status  smoking_status_encoded
0        Current                       1
1          Never                       0
2          Never                       0
3        Current                       1
4          Never                       0


In [14]:
nominal_cols = ['gender', 'ethnicity', 'employment_status']

df_train = pd.get_dummies(df_train, columns=nominal_cols, drop_first=True) 


### scaling

In [15]:
df_train['cholesterol_ratio'] = df_train['ldl_cholesterol'] / df_train['hdl_cholesterol']

df_train['mean_blood_pressure'] = (df_train['systolic_bp'] + df_train['diastolic_bp']) / 2

In [16]:
numerical_cols = [
    'age', 'alcohol_consumption_per_week', 'physical_activity_minutes_per_week',
    'diet_score', 'sleep_hours_per_day', 'screen_time_hours_per_day', 'bmi',
    'waist_to_hip_ratio', 'systolic_bp', 'diastolic_bp', 'heart_rate',
    'cholesterol_total', 'hdl_cholesterol', 'ldl_cholesterol', 'triglycerides','cholesterol_ratio','mean_blood_pressure'
]

scaler = StandardScaler()
df_train[numerical_cols] = scaler.fit_transform(df_train[numerical_cols])


In [17]:
cols_to_drop = ['id', 'income_level', 'education_level', 'smoking_status']
df_train = df_train.drop(columns=cols_to_drop)

In [18]:
final_features = df_train.drop(columns=['diagnosed_diabetes']).columns.tolist()
len(final_features)

32

## Train-Validation Split

The dataset is split into training and validation sets in order to evaluate model performance on unseen data.


In [19]:
X = df_train.drop(columns=['diagnosed_diabetes'])
y = df_train['diagnosed_diabetes']

X_train, X_val, y_train, y_val = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)


print(f"X_train training set size: {X_train.shape}")
print(f" X_val X_train validation set size: {X_val.shape}")
print(f"X_train training set size y_train: {y_train.shape}")
print(f"y_val validation set size: {y_val.shape}")

X_train training set size: (560000, 32)
 X_val X_train validation set size: (140000, 32)
X_train training set size y_train: (560000,)
y_val validation set size: (140000,)


## Model Training

Several machine learning models are trained and compared.
The objective is to identify the model that provides the best ROC AUC score.


In [20]:
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import roc_auc_score

gb_classifier = GradientBoostingClassifier(random_state=42)

gb_classifier.fit(X_train, y_train)

y_pred_proba = gb_classifier.predict_proba(X_val)[:, 1]

auc_roc = roc_auc_score(y_val, y_pred_proba)

print(f"AUC-ROC score of Gradient Boosting across the validation set: {auc_roc:.5f}")

AUC-ROC score of Gradient Boosting across the validation set: 0.70709


In [21]:
import pandas as pd
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import roc_auc_score

param_grid = {
    'n_estimators': [100, 150, 200, 250, 300],  
    'learning_rate': [0.05, 0.1, 0.15, 0.2, 0.25], 
    'max_depth': [3], 
    'subsample': [0.8], 
}


gb_model = GradientBoostingClassifier(random_state=42)

grid_search = GridSearchCV(
    estimator=gb_model, 
    param_grid=param_grid, 
    scoring='roc_auc', 
    cv=3, 
    verbose=2, 
    n_jobs=-1
)

print("Starting of Grid Search...")
grid_search.fit(X_train, y_train)

print("\nGrid Search results:")
print(f"Best score: {grid_search.best_score_:.5f}")
print(f"Best hyperparameters : {grid_search.best_params_}")

best_gb_model = grid_search.best_estimator_
y_pred_proba_tuned = best_gb_model.predict_proba(X_val)[:, 1]
final_auc_roc = roc_auc_score(y_val, y_pred_proba_tuned)

print(f"Final AUC-ROC score across the validation set (with optimized hyperparameters) : {final_auc_roc:.5f}")

Starting of Grid Search...
Fitting 3 folds for each of 25 candidates, totalling 75 fits
[CV] END learning_rate=0.05, max_depth=3, n_estimators=100, subsample=0.8; total time= 2.5min
[CV] END learning_rate=0.05, max_depth=3, n_estimators=100, subsample=0.8; total time= 2.5min
[CV] END learning_rate=0.05, max_depth=3, n_estimators=100, subsample=0.8; total time= 2.5min
[CV] END learning_rate=0.05, max_depth=3, n_estimators=150, subsample=0.8; total time= 3.8min
[CV] END learning_rate=0.05, max_depth=3, n_estimators=150, subsample=0.8; total time= 3.8min
[CV] END learning_rate=0.05, max_depth=3, n_estimators=150, subsample=0.8; total time= 3.9min
[CV] END learning_rate=0.05, max_depth=3, n_estimators=200, subsample=0.8; total time= 5.1min
[CV] END learning_rate=0.05, max_depth=3, n_estimators=200, subsample=0.8; total time= 5.2min
[CV] END learning_rate=0.1, max_depth=3, n_estimators=100, subsample=0.8; total time= 2.7min
[CV] END learning_rate=0.05, max_depth=3, n_estimators=200, subsamp

## Model Evaluation

Models are evaluated using ROC AUC on the validation set.
This allows us to compare performance and detect potential overfitting.

The fact that the overall validation score (0.72527) is very close to the best cross-validation score (0.72436) is a sign of stability and confirms that the model is not significantly overfitting.

The best hyperparameters found are:

learning_rate: 0.25

n_estimators: 300

This shows that the model benefits from a faster learning rate and a larger number of trees to achieve this performance.

##  Final Model & Kaggle Submission

The best-performing model is selected to generate predictions on the test dataset.
These predictions are then formatted according to Kaggle submission requirements.



In [22]:
df_test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 300000 entries, 0 to 299999
Data columns (total 25 columns):
 #   Column                              Non-Null Count   Dtype  
---  ------                              --------------   -----  
 0   id                                  300000 non-null  int64  
 1   age                                 300000 non-null  int64  
 2   alcohol_consumption_per_week        300000 non-null  int64  
 3   physical_activity_minutes_per_week  300000 non-null  int64  
 4   diet_score                          300000 non-null  float64
 5   sleep_hours_per_day                 300000 non-null  float64
 6   screen_time_hours_per_day           300000 non-null  float64
 7   bmi                                 300000 non-null  float64
 8   waist_to_hip_ratio                  300000 non-null  float64
 9   systolic_bp                         300000 non-null  int64  
 10  diastolic_bp                        300000 non-null  int64  
 11  heart_rate                

In [30]:
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import GradientBoostingClassifier


income_mapping = {'Low': 1, 'Lower-Middle': 2, 'Middle': 3, 'Upper-Middle': 4, 'High': 5}
education_mapping = {'No formal': 0, 'Highschool': 1, 'Graduate': 2, 'Postgraduate': 3}
smoking_mapping = {'Never': 0, 'Current': 1, 'Former': 2}
nominal_cols = ['gender', 'ethnicity', 'employment_status']
numerical_cols = [
    'age', 'alcohol_consumption_per_week', 'physical_activity_minutes_per_week',
    'diet_score', 'sleep_hours_per_day', 'screen_time_hours_per_day', 'bmi',
    'waist_to_hip_ratio', 'systolic_bp', 'diastolic_bp', 'heart_rate',
    'cholesterol_total', 'hdl_cholesterol', 'ldl_cholesterol', 'triglycerides','cholesterol_ratio','mean_blood_pressure'
]

df_test = pd.read_csv('data/test.csv')

test_ids = df_test['id'] 

df_test['cholesterol_ratio'] = df_test['ldl_cholesterol'] / df_test['hdl_cholesterol']
df_test['mean_blood_pressure'] = (df_test['systolic_bp'] + df_test['diastolic_bp']) / 2

df_test['income_level_encoded'] = df_test['income_level'].map(income_mapping)
df_test['education_level_encoded'] = df_test['education_level'].map(education_mapping)
df_test['smoking_status_encoded'] = df_test['smoking_status'].map(smoking_mapping)

df_test = pd.get_dummies(df_test, columns=nominal_cols, drop_first=True, dtype=int)

df_test[numerical_cols] = scaler.transform(df_test[numerical_cols])

cols_to_drop = [
    'id', 
    'income_level', 
    'education_level', 
    'smoking_status'
]
df_test = df_test.drop(columns=cols_to_drop, errors='ignore') 



X_test_final = df_test.reindex(columns=X_train.columns, fill_value=0)


predictions = best_gb_model.predict_proba(X_test_final)[:, 1]


In [31]:
submission = pd.DataFrame({'id': test_ids, 'diagnosed_diabetes': predictions})
submission.to_csv('submission.csv', index=False)