In [1]:
import kagglehub

# Download latest version
path = kagglehub.dataset_download("iammustafatz/diabetes-prediction-dataset")

print("Path to dataset files:", path)

Downloading from https://www.kaggle.com/api/v1/datasets/download/iammustafatz/diabetes-prediction-dataset?dataset_version_number=1...


100%|██████████| 734k/734k [00:00<00:00, 70.7MB/s]

Extracting files...
Path to dataset files: /root/.cache/kagglehub/datasets/iammustafatz/diabetes-prediction-dataset/versions/1





## Importing Necessary Libraries


In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

## Importing the Dataset

In [6]:
import os
dataset_path = os.path.join(path, "diabetes_prediction_dataset.csv")
dataset = pd.read_csv(dataset_path)

## First Sight in Dataset

In [7]:
dataset.head()

Unnamed: 0,gender,age,hypertension,heart_disease,smoking_history,bmi,HbA1c_level,blood_glucose_level,diabetes
0,Female,80.0,0,1,never,25.19,6.6,140,0
1,Female,54.0,0,0,No Info,27.32,6.6,80,0
2,Male,28.0,0,0,never,27.32,5.7,158,0
3,Female,36.0,0,0,current,23.45,5.0,155,0
4,Male,76.0,1,1,current,20.14,4.8,155,0


In [9]:
dataset.info()
print("")
dataset.isnull().sum()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 9 columns):
 #   Column               Non-Null Count   Dtype  
---  ------               --------------   -----  
 0   gender               100000 non-null  object 
 1   age                  100000 non-null  float64
 2   hypertension         100000 non-null  int64  
 3   heart_disease        100000 non-null  int64  
 4   smoking_history      100000 non-null  object 
 5   bmi                  100000 non-null  float64
 6   HbA1c_level          100000 non-null  float64
 7   blood_glucose_level  100000 non-null  int64  
 8   diabetes             100000 non-null  int64  
dtypes: float64(3), int64(4), object(2)
memory usage: 6.9+ MB



Unnamed: 0,0
gender,0
age,0
hypertension,0
heart_disease,0
smoking_history,0
bmi,0
HbA1c_level,0
blood_glucose_level,0
diabetes,0


## Processing Data

In [10]:
X = dataset.drop("diabetes", axis=1)
y = dataset["diabetes"]

## Split Data into Train and Test

In [11]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

## Column Transfer

In [12]:
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer

preprocessor = ColumnTransformer(transformers=[
    ('num', StandardScaler(), ['age', 'bmi', 'HbA1c_level', 'blood_glucose_level']),
    ('cat', OneHotEncoder(drop='first'), ['gender', 'smoking_history'])
])

## Models to Evaluating

In [13]:
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC

models = {
    'Logistic Regression': LogisticRegression(),
    'Random Forest': RandomForestClassifier(),
    'Gradient Boosting': GradientBoostingClassifier(),
    'SVM': SVC()
}

## Evaluating the Each Modal

In [14]:
from sklearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

for name, model in models.items():
    pipeline = Pipeline(steps=[('preprocessor', preprocessor), ('model', model)])

    pipeline.fit(X_train, y_train)

    y_pred = pipeline.predict(X_test)

    print(f"--- {name} ---")
    print("Accuracy:", accuracy_score(y_test, y_pred))
    print("Precision:", precision_score(y_test, y_pred))
    print("Recall:", recall_score(y_test, y_pred))
    print("F1 Score:", f1_score(y_test, y_pred))
    print("\n")

--- Logistic Regression ---
Accuracy: 0.9586
Precision: 0.8716216216216216
Recall: 0.6042154566744731
F1 Score: 0.7136929460580913


--- Random Forest ---
Accuracy: 0.96945
Precision: 0.934283452098179
Recall: 0.6908665105386417
F1 Score: 0.7943453382699428


--- Gradient Boosting ---
Accuracy: 0.97225
Precision: 0.9957007738607051
Recall: 0.6779859484777517
F1 Score: 0.8066875653082549


--- SVM ---
Accuracy: 0.96465
Precision: 0.993103448275862
Recall: 0.5901639344262295
F1 Score: 0.7403598971722365




### MD

Here’s a `README.md` file for your dataset analysis and machine learning model evaluation:

---

# Diabetes Prediction Analysis

This project explores a dataset on diabetes prediction, applies various machine learning models to classify diabetes cases, and evaluates their performance using key metrics. The goal is to identify the most effective model for predicting diabetes based on patient data.

## Table of Contents
1. [Dataset Information](#dataset-information)
2. [Installation](#installation)
3. [Exploratory Data Analysis (EDA)](#exploratory-data-analysis-eda)
4. [Machine Learning Models](#machine-learning-models)
5. [Model Evaluation](#model-evaluation)
6. [Conclusion](#conclusion)
7. [License](#license)

## Dataset Information
The dataset contains the following features:

- `gender`: Gender of the patient
- `age`: Age of the patient
- `hypertension`: Presence of hypertension (1 = Yes, 0 = No)
- `heart_disease`: Presence of heart disease (1 = Yes, 0 = No)
- `smoking_history`: Smoking history category
- `bmi`: Body mass index
- `HbA1c_level`: Average blood glucose level over the past 2-3 months
- `blood_glucose_level`: Current blood glucose level
- `diabetes`: Target variable (1 = Diabetes, 0 = No Diabetes)

The dataset includes 100,000 entries with no missing values, allowing for straightforward preprocessing.

## Exploratory Data Analysis (EDA)

Exploratory Data Analysis (EDA) was conducted to understand the distribution and relationships within the dataset. Key findings include:

- The dataset is balanced in terms of the target variable (`diabetes`).
- Certain features, like `age`, `bmi`, and `blood_glucose_level`, appear to have a significant relationship with the target.
- Categorical variables, such as `smoking_history` and `gender`, were encoded for model compatibility.

## Machine Learning Models

Four machine learning models were applied to the dataset:

1. **Logistic Regression**
2. **Random Forest Classifier**
3. **Gradient Boosting Classifier**
4. **Support Vector Machine (SVM)**

### Model Training

Each model was trained using a 70-30 train-test split, and evaluation metrics were calculated on the test set.

## Model Evaluation

The models were evaluated based on the following metrics:

- **Accuracy**: The overall correctness of predictions.
- **Precision**: The accuracy of positive predictions.
- **Recall**: The ability to capture actual positive cases.
- **F1 Score**: The harmonic mean of precision and recall, balancing the two.

### Results Summary

| Model               | Accuracy | Precision | Recall  | F1 Score |
|---------------------|----------|-----------|---------|----------|
| Logistic Regression | 95.86%   | 0.872     | 0.604   | 0.714    |
| Random Forest       | 96.94%   | 0.934     | 0.691   | 0.794    |
| Gradient Boosting   | 97.23%   | 0.996     | 0.678   | 0.807    |
| SVM                 | 96.47%   | 0.993     | 0.590   | 0.740    |

- **Gradient Boosting** achieved the highest accuracy and F1 Score, making it the top-performing model in this analysis.

## Conclusion

The **Gradient Boosting** model performed the best for diabetes prediction, balancing high precision and recall. The model could be further enhanced with hyperparameter tuning or additional feature engineering.

For use in a real-world setting, this model provides a reliable starting point for diabetes prediction, but care should be taken to evaluate it further with cross-validation and perhaps larger, more diverse datasets.


