<div style="background-color: #30302E; padding: 20px; border-radius: 10px; border-left: 5px solid #007bff;">
    <h1 style="text-align: center;">🔬 Diabetes Prediction Project</h1>
    <h2 style=" text-align: center;">Comprehensive Exploratory Data Analysis</h2>
</div>

<div style="background-color: #30302E; padding: 15px; border-radius: 5px; margin-top: 20px;">
    <p><strong>Project Overview:</strong> This notebook analyzes a dataset for binary diabetes classification: No Diabetes (0) and Diabetes (1). We aim to uncover key insights and patterns to support the development of a robust classification model.</p>
</div>

<div style="background-color: #30302E; padding: 15px; border-radius: 5px;">
    <h3 style="color: #856404;">📋 Analysis Objectives</h3>
    <ul>
        <li>Understand feature distributions and identify any skewness</li>
        <li>Analyze class imbalance in the target variable</li>
        <li>Analyze feature relationships and correlations</li>
        <li>Identify key predictors for diabetes</li>
        <li>Guide feature engineering and preprocessing decisions</li>
    </ul>
</div>

## 📦 Import Libraries and Load Data

In [2]:
# Import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path
import warnings
import sys

# Ignore warnings for cleaner notebook output
warnings.filterwarnings('ignore')

# Set plot style
plt.style.use('seaborn-v0_8-whitegrid')
sns.set_palette('viridis')

# Add the project root to the path to import project modules
sys.path.append(str(Path.cwd().parent))

# Import our custom analysis modules
from analysis.analyze_distributions import analyze_distributions
from analysis.analyze_class_imbalance import analyze_class_imbalance
from analysis.analyze_correlations import analyze_correlations
from analysis.analyze_feature_importance import analyze_feature_importance
from src.data.data_versioning import DataVersioner

In [4]:
# Load the diabetes dataset
data_path = '../data/extracted/diabetes_prediction_dataset/diabetes_prediction_dataset.csv'
data = pd.read_csv(data_path)

# Display basic information
print("Dataset Overview:")
print(f"Number of samples: {len(data)}")
print(f"Number of features: {len(data.columns)-1}\n")

display(data.head())

Dataset Overview:
Number of samples: 100000
Number of features: 8



Unnamed: 0,gender,age,hypertension,heart_disease,smoking_history,bmi,HbA1c_level,blood_glucose_level,diabetes
0,Female,80.0,0,1,never,25.19,6.6,140,0
1,Female,54.0,0,0,No Info,27.32,6.6,80,0
2,Male,28.0,0,0,never,27.32,5.7,158,0
3,Female,36.0,0,0,current,23.45,5.0,155,0
4,Male,76.0,1,1,current,20.14,4.8,155,0


In [5]:
# Display data info and summary statistics
print("Data Info:")
display(data.info())

print("\nNumerical Features Summary:")
display(data.describe())

Data Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 9 columns):
 #   Column               Non-Null Count   Dtype  
---  ------               --------------   -----  
 0   gender               100000 non-null  object 
 1   age                  100000 non-null  float64
 2   hypertension         100000 non-null  int64  
 3   heart_disease        100000 non-null  int64  
 4   smoking_history      100000 non-null  object 
 5   bmi                  100000 non-null  float64
 6   HbA1c_level          100000 non-null  float64
 7   blood_glucose_level  100000 non-null  int64  
 8   diabetes             100000 non-null  int64  
dtypes: float64(3), int64(4), object(2)
memory usage: 6.9+ MB


None


Numerical Features Summary:


Unnamed: 0,age,hypertension,heart_disease,bmi,HbA1c_level,blood_glucose_level,diabetes
count,100000.0,100000.0,100000.0,100000.0,100000.0,100000.0,100000.0
mean,41.885856,0.07485,0.03942,27.320767,5.527507,138.05806,0.085
std,22.51684,0.26315,0.194593,6.636783,1.070672,40.708136,0.278883
min,0.08,0.0,0.0,10.01,3.5,80.0,0.0
25%,24.0,0.0,0.0,23.63,4.8,100.0,0.0
50%,43.0,0.0,0.0,27.32,5.8,140.0,0.0
75%,60.0,0.0,0.0,29.58,6.2,159.0,0.0
max,80.0,1.0,1.0,95.69,9.0,300.0,1.0


## 🔍 Data Quality Check

In [6]:
# Check for missing values
missing_values = data.isnull().sum()
print("Missing Values:")
display(missing_values[missing_values > 0] if missing_values.sum() > 0 else "No missing values found")

# Check for duplicates
duplicates = data.duplicated().sum()
print(f"\nNumber of duplicate rows: {duplicates}")

# Display unique values in categorical columns
categorical_columns = ['gender', 'smoking_history']
print("\nUnique values in categorical columns:")
for col in categorical_columns:
    print(f"\n{col}:")
    display(data[col].value_counts())

Missing Values:


'No missing values found'


Number of duplicate rows: 3854

Unique values in categorical columns:

gender:


gender
Female    58552
Male      41430
Other        18
Name: count, dtype: int64


smoking_history:


smoking_history
No Info        35816
never          35095
former          9352
current         9286
not current     6447
ever            4004
Name: count, dtype: int64

## 📊 Analyze Distributions

In [7]:
analyze_distributions(data)

[2025-02-24 01:17:18] |     INFO | [analyze_distributions.py:  88] | analyze_distribution | The categorical feature distribution analysis is completed and saved in reports\distributions_figures
[2025-02-24 01:17:40] |     INFO | [analyze_distributions.py: 121] | analyze_distribution | The continuous feature distribution analysis is completed and saved in reports\distributions_figures
[2025-02-24 01:17:42] |     INFO | [analyze_distributions.py: 152] | analyze_distribution | The binary feature distribution analysis is completed and saved in reports\distributions_figures


## 📊 Analyze Class Imbalance

In [8]:
analyze_class_imbalance(data)

[2025-02-24 01:19:41] |     INFO | [analyze_class_imbalance.py:  78] | Class imbalance analysis | The Class Distribution plot is saved in reports\class_imbalance_figures


## 📈 Feature Correlations

In [9]:
analyze_correlations(data)

[2025-02-24 01:20:53] |     INFO | [analyze_correlations.py:  64] | Correlation analysis | The Feature Correlation plot is saved in reports\correlations_figures
[2025-02-24 01:20:54] |     INFO | [analyze_correlations.py:  83] | Correlation analysis | The target correlations plot is saved in reports\correlations_figures


Unnamed: 0,gender,age,hypertension,heart_disease,smoking_history,bmi,HbA1c_level,blood_glucose_level,diabetes
gender,1.0,-0.030656,0.014203,0.077696,0.05433,-0.022994,0.019957,0.017199,0.037411
age,-0.030656,1.0,0.251171,0.233354,0.143647,0.337396,0.101354,0.110672,0.258008
hypertension,0.014203,0.251171,1.0,0.121262,0.031913,0.147666,0.080939,0.084429,0.197823
heart_disease,0.077696,0.233354,0.121262,1.0,0.071547,0.061198,0.067589,0.070066,0.171727
smoking_history,0.05433,0.143647,0.031913,0.071547,1.0,0.068321,0.023195,0.023031,0.057908
bmi,-0.022994,0.337396,0.147666,0.061198,0.068321,1.0,0.082997,0.091261,0.214357
HbA1c_level,0.019957,0.101354,0.080939,0.067589,0.023195,0.082997,1.0,0.166733,0.40066
blood_glucose_level,0.017199,0.110672,0.084429,0.070066,0.023031,0.091261,0.166733,1.0,0.419558
diabetes,0.037411,0.258008,0.197823,0.171727,0.057908,0.214357,0.40066,0.419558,1.0


# Analyze of feature Importance

In [10]:
analyze_feature_importance(data)

[2025-02-24 01:22:09] |     INFO | [analyze_feature_importance.py:  74] | Feature_importance | Successfully saving the feature importance plot in reports\feature_importance_figures


{'HbA1c_level': 0.3365363646399916,
 'blood_glucose_level': 0.2693696958076413,
 'age': 0.19299985733952751,
 'bmi': 0.13130055603219237,
 'smoking_history': 0.025885870068770415,
 'hypertension': 0.023133620713561794,
 'heart_disease': 0.013866784027626991,
 'gender': 0.006907251370688071}

<div style="background-color: #30302E; padding: 20px; border-radius: 10px; margin-bottom: 20px;">
    <h1 style="color: #FFFFFF; text-align: center;">📊 Diabetes Analysis Report</h1>
    <p style="color: #CCCCCC; text-align: center;">Comprehensive Analysis of Diabetes Classification Dataset</p>
</div>

<div style="background-color: #30302E; padding: 20px; border-radius: 10px; margin-bottom: 20px;">
    <h2 style="color: #FFFFFF;">📈 Dataset Overview</h2>
    <ul style="color: #CCCCCC;">
        <li>Total samples: 100,000 records</li>
        <li>Features: 9 variables including demographic, health metrics, and medical history</li>
        <li>Target: Binary classification (No Diabetes: 91.5%, Diabetes: 8.5%)</li>
    </ul>
</div>

<div style="background-color: #30302E; padding: 20px; border-radius: 10px; margin-bottom: 20px;">
    <h2 style="color: #FFFFFF;">🔑 Key Feature Importance</h2>
    <p style="color: #CCCCCC;">Top predictors by importance:</p>
    <ol style="color: #CCCCCC;">
        <li><strong>HbA1c Level (33.7%)</strong>: Strongest predictor of diabetes</li>
        <li><strong>Blood Glucose Level (26.9%)</strong>: Second most important indicator</li>
        <li><strong>Age (19.3%)</strong>: Significant demographic factor</li>
        <li><strong>BMI (13.1%)</strong>: Important health metric</li>
        <li><strong>Other factors</strong>: Smoking history, hypertension, heart disease, and gender have lower but notable importance</li>
    </ol>
</div>

<div style="background-color: #30302E; padding: 20px; border-radius: 10px; margin-bottom: 20px;">
    <h2 style="color: #FFFFFF;">👥 Demographic Distribution</h2>
    <ul style="color: #CCCCCC;">
        <li><strong>Gender</strong>: 
            <ul>
                <li>Female: 58,552 (58.5%)</li>
                <li>Male: 41,430 (41.4%)</li>
                <li>Other: 18 (0.1%)</li>
            </ul>
        </li>
        <li><strong>Age Distribution</strong>: 
            <ul>
                <li>Relatively uniform distribution across age groups</li>
                <li>Slight increase in middle-age categories</li>
                <li>Notable peak around age 80</li>
            </ul>
        </li>
    </ul>
</div>

<div style="background-color: #30302E; padding: 20px; border-radius: 10px; margin-bottom: 20px;">
    <h2 style="color: #FFFFFF;">🏥 Health Metrics Analysis</h2>
    <ul style="color: #CCCCCC;">
        <li><strong>BMI Distribution</strong>:
            <ul>
                <li>Median BMI: ~27</li>
                <li>Notable right skew with outliers above 60</li>
                <li>Most values concentrated between 20-40</li>
            </ul>
        </li>
        <li><strong>Blood Glucose Levels</strong>:
            <ul>
                <li>Strong correlation with diabetes (0.42)</li>
                <li>Multimodal distribution with peaks around normal and elevated ranges</li>
            </ul>
        </li>
        <li><strong>HbA1c Levels</strong>:
            <ul>
                <li>Very strong correlation with diabetes (0.41)</li>
                <li>Clear separation between normal and elevated ranges</li>
            </ul>
        </li>
    </ul>
</div>

<div style="background-color: #30302E; padding: 20px; border-radius: 10px; margin-bottom: 20px;">
    <h2 style="color: #FFFFFF;">🚬 Lifestyle Factors</h2>
    <p style="color: #CCCCCC;">Smoking history distribution:</p>
    <ul style="color: #CCCCCC;">
        <li>Never smoked: 35,095 (35.1%)</li>
        <li>No information: 35,816 (35.8%)</li>
        <li>Current smoker: 9,286 (9.3%)</li>
        <li>Former smoker: 9,352 (9.4%)</li>
        <li>Ever smoked: 4,004 (4.0%)</li>
        <li>Not currently smoking: 6,447 (6.4%)</li>
    </ul>
</div>

<div style="background-color: #30302E; padding: 20px; border-radius: 10px; margin-bottom: 20px;">
    <h2 style="color: #FFFFFF;">⚕️ Medical History</h2>
    <ul style="color: #CCCCCC;">
        <li><strong>Hypertension</strong>:
            <ul>
                <li>Present in 7,485 patients (7.5%)</li>
                <li>Moderate correlation with diabetes (0.23)</li>
            </ul>
        </li>
        <li><strong>Heart Disease</strong>:
            <ul>
                <li>Present in 3,942 patients (3.9%)</li>
                <li>Positive correlation with diabetes (0.18)</li>
            </ul>
        </li>
    </ul>
</div>

<div style="background-color: #30302E; padding: 20px; border-radius: 10px; margin-bottom: 20px;">
    <h2 style="color: #FFFFFF;">💡 Key Insights & Recommendations</h2>
    <ol style="color: #CCCCCC;">
        <li><strong>Class Imbalance</strong>: The significant class imbalance (91.5% vs 8.5%) will require special handling during model development, such as:
            <ul>
                <li>SMOTE or other oversampling techniques</li>
                <li>Class weights in model training</li>
                <li>Ensemble methods</li>
            </ul>
        </li>
        <li><strong>Feature Engineering</strong>: Consider creating:
            <ul>
                <li>BMI categories based on standard ranges</li>
                <li>Age group categories</li>
                <li>Interaction terms between correlated features</li>
            </ul>
        </li>
        <li><strong>Model Selection</strong>: Prioritize:
            <ul>
                <li>Models that handle imbalanced data well</li>
                <li>Algorithms that can capture non-linear relationships</li>
                <li>Techniques that provide feature importance rankings</li>
            </ul>
        </li>
    </ol>
</div>