# Diabetes Prediction Notebook
This notebook trains and compares XGBoost, Random Forest, and Gradient Boosting models to predict `diagnosed_diabetes`.

**Workflow:**
- Load data (`train.csv`, `test.csv`, `sample_submission.csv`)
- Explore missing values and basic shapes
- Encode categorical features with `LabelEncoder`
- Split data into train/validation sets
- Train XGBoost, Random Forest, and Gradient Boosting models
- Compare accuracies and create submission CSV

In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split 
from sklearn.metrics import accuracy_score 
from sklearn.preprocessing import LabelEncoder 

In [2]:
train_df = pd.read_csv('train.csv')
test_df = pd.read_csv('test.csv')
submission_df = pd.read_csv('sample_submission.csv')


**Load datasets** — read `train.csv`, `test.csv`, and `sample_submission.csv` into DataFrames.

In [4]:
train_df.shape
test_df.shape

(300000, 25)

**Explore shapes** — inspect the number of rows and columns in the train and test sets.

In [6]:
test_df.isnull().sum()

id                                    0
age                                   0
alcohol_consumption_per_week          0
physical_activity_minutes_per_week    0
diet_score                            0
sleep_hours_per_day                   0
screen_time_hours_per_day             0
bmi                                   0
waist_to_hip_ratio                    0
systolic_bp                           0
diastolic_bp                          0
heart_rate                            0
cholesterol_total                     0
hdl_cholesterol                       0
ldl_cholesterol                       0
triglycerides                         0
gender                                0
ethnicity                             0
education_level                       0
income_level                          0
smoking_status                        0
employment_status                     0
family_history_diabetes               0
hypertension_history                  0
cardiovascular_history                0


**Missing values** — count missing entries per column in the test set to guide preprocessing.

In [7]:
encode = LabelEncoder()

**Prepare encoder** — create a `LabelEncoder` instance for encoding categorical features.

In [8]:
cat_cols = train_df.select_dtypes(include=['object']).columns

for col in cat_cols : 
    train_df[col]=encode.fit_transform(train_df[col])
    test_df[col]=encode.fit_transform(test_df[col])


**Encode categorical features** — transform object-type columns in both train and test using `LabelEncoder`.

In [10]:
train_df.head(10)

Unnamed: 0,id,age,alcohol_consumption_per_week,physical_activity_minutes_per_week,diet_score,sleep_hours_per_day,screen_time_hours_per_day,bmi,waist_to_hip_ratio,systolic_bp,...,gender,ethnicity,education_level,income_level,smoking_status,employment_status,family_history_diabetes,hypertension_history,cardiovascular_history,diagnosed_diabetes
0,0,31,1,45,7.7,6.8,6.1,33.4,0.93,112,...,0,2,1,2,0,0,0,0,0,1
1,1,50,2,73,5.7,6.5,5.8,23.8,0.83,120,...,0,4,1,4,2,0,0,0,0,1
2,2,32,3,158,8.5,7.4,9.1,24.1,0.83,95,...,1,2,1,2,2,1,0,0,0,0
3,3,54,3,77,4.6,7.0,9.2,26.6,0.83,121,...,0,4,1,2,0,0,0,1,0,1
4,4,54,1,55,5.7,6.2,5.1,28.8,0.9,108,...,1,4,1,4,2,1,0,1,0,1
5,5,42,1,100,4.4,6.4,5.3,25.5,0.84,111,...,0,4,1,2,2,1,0,0,0,0
6,6,41,2,148,3.4,5.6,3.7,27.9,0.89,130,...,0,4,0,2,0,0,0,0,0,1
7,7,51,3,102,4.0,7.3,5.5,27.1,0.83,125,...,1,0,1,1,2,0,1,0,0,1
8,8,34,2,44,2.7,7.0,7.9,22.6,0.81,120,...,1,4,1,2,2,0,0,0,0,0
9,9,44,1,36,5.8,5.7,6.6,29.3,0.88,110,...,1,2,1,3,2,0,1,0,0,1


**Preview data** — display the first 10 rows of the training set to verify preprocessing.

In [11]:
train_df['education_level'].unique()

array([1, 0, 3, 2])

In [14]:
x = train_df.drop(['id','diagnosed_diabetes'],axis=1)
y = train_df['diagnosed_diabetes']

**Feature/target split** — separate features `X` and target `y` for model training.

In [15]:
x_train, x_test, y_train, y_test = train_test_split(x,y,random_state=2,test_size=0.2)

**Train/validation split** — hold out 20% of data for validation with a fixed random seed.

In [16]:
import xgboost as xgb 
model = xgb.XGBClassifier(
    n_estimators=200,
    learning_rate=0.05,
    max_depth=5,
    subsample=0.9,
    colsample_bytree=0.9,
    min_child_weight=1,
    random_state=2
)
model.fit(x_train,y_train)
from sklearn.ensemble import RandomForestClassifier 

rf_model = RandomForestClassifier(
    n_estimators=200,
    max_depth=15,
    min_samples_split=5,
    min_samples_leaf=2,
    random_state=2,
    n_jobs=-1
)
rf_model.fit(x_train,y_train)
from sklearn.ensemble import GradientBoostingClassifier 

gb_model = GradientBoostingClassifier(
    n_estimators=200,
    learning_rate=0.05,
    max_depth=5,
    min_samples_split=5,
    random_state=2
)
gb_model.fit(x_train,y_train)

0,1,2
,loss,'log_loss'
,learning_rate,0.05
,n_estimators,200
,subsample,1.0
,criterion,'friedman_mse'
,min_samples_split,5
,min_samples_leaf,1
,min_weight_fraction_leaf,0.0
,max_depth,5
,min_impurity_decrease,0.0


**Train models** — train XGBoost, Random Forest, and Gradient Boosting with configured hyperparameters.

In [17]:
# Compare all models on test set
xgb_pred = model.predict(x_test)
rf_pred = rf_model.predict(x_test)
gb_pred = gb_model.predict(x_test)

xgb_acc = accuracy_score(y_test, xgb_pred)
rf_acc = accuracy_score(y_test, rf_pred)
gb_acc = accuracy_score(y_test, gb_pred)

print("=== Model Comparison ===")
print(f"Improved XGBoost: {xgb_acc:.4f}")
print(f"Random Forest: {rf_acc:.4f}")
print(f"Gradient Boosting: {gb_acc:.4f}")

# Select best model
models_scores = [('XGBoost', model, xgb_acc), ('RandomForest', rf_model, rf_acc), ('GradientBoosting', gb_model, gb_acc)]
best_name, best_model_final, best_acc = max(models_scores, key=lambda x: x[2])
print(f"\nBest Model: {best_name} with accuracy {best_acc:.4f}")

# Make predictions with best model
X_test_submission = test_df.drop(['id'], axis=1)
test_predictions = best_model_final.predict(X_test_submission)

submission_df['diagnosed_diabetes'] = test_predictions
submission_df.to_csv('submission_improved.csv', index=False)

print(f"Submission file saved as 'submission_improved.csv' using {best_name}")

=== Model Comparison ===
Improved XGBoost: 0.6763
Random Forest: 0.6673
Gradient Boosting: 0.6771

Best Model: GradientBoosting with accuracy 0.6771
Submission file saved as 'submission_improved.csv' using GradientBoosting


**Evaluate & submit** — compute accuracy for each model, choose the best, and save predictions to `submission_improved.csv`.