## 1. Problem Statement

Diabetes is a chronic disease that often progresses silently for years before clinical diagnosis.
Delayed identification significantly increases the risk of severe complications such as cardiovascular disease, kidney failure, and neuropathy, leading to higher mortality rates and escalating healthcare costs.

Healthcare systems face a critical operational challenge: how to proactively identify individuals at high risk of diabetes within large populations using limited clinical resources.

Traditional screening strategies are often reactive and resource-intensive, relying on laboratory tests applied uniformly across populations. This approach makes early intervention difficult and inefficient, especially when prioritization is required.

This project addresses the need for a data-driven screening mechanism that supports early risk identification and enables healthcare providers to prioritize patients who require immediate clinical evaluation.

## 2. Objective

The objective of this project is to develop a binary classification model that functions as an early-stage screening tool for diabetes risk.

The model is not intended to provide a clinical diagnosis. Instead, it aims to identify individuals who should be prioritized for further medical testing, such as blood glucose or HbA1c exams.

Given the clinical context, the model is optimized to minimize false negatives, as failing to identify a high-risk individual may lead to delayed treatment and severe long-term complications.

Therefore, Recall is selected as the primary evaluation metric, reflecting the systemâ€™s role as a high-sensitivity triage mechanism rather than a precision-focused diagnostic model.

## 4. Methodology

The project follows a structured data science pipeline designed to support clinical risk triage decisions.

1. Data Understanding and Quality Assessment
   Initial analysis focused on understanding the dataset structure, target imbalance, and potential data integrity issues such as duplicated records.

2. Exploratory Data Analysis (EDA)
   Exploratory analysis was conducted to identify relationships between health indicators and diabetes prevalence, with particular attention to physiological risk factors such as BMI, blood pressure, and cholesterol.

3. Feature Engineering
   Domain-informed transformations were applied to improve signal quality, including capping extreme BMI values and deriving composite health indicators to capture sustained poor health conditions.

4. Model Development
   Multiple tree-based models were evaluated, with a focus on techniques robust to class imbalance. Gradient boosting methods were prioritized due to their ability to emphasize hard-to-classify minority cases.

5. Evaluation and Threshold Optimization
   Model performance was assessed using clinically motivated metrics, with classification thresholds adjusted to prioritize Recall and minimize false negatives in a triage context.

6. Interpretability and Validation
   Model interpretability techniques were applied to ensure predictions aligned with known clinical risk factors, supporting trust and transparency in healthcare decision-making.

In [2]:
import sys
import os

# Add project root to Python path
project_root = os.path.abspath("..")
if project_root not in sys.path:
    sys.path.append(project_root)

In [3]:
from src.preprocessing import basic_cleaning, engineer_features, prepare_train_test_split