# Diabetes Prediction Using Machine Learning

## IIT Guwahati - Data Science Project

**Project Overview:**  
This project aims to predict diabetes risk using machine learning algorithms on the Pima Indians Diabetes dataset. We employ Logistic Regression and Linear Discriminant Analysis (LDA) to build predictive models and evaluate their performance.

**Dataset:** Pima Indians Diabetes Dataset (768 samples, 8 features)

**Algorithms:** Logistic Regression, Linear Discriminant Analysis (LDA)

**Objective:** Develop and evaluate classification models to predict diabetes based on patient health metrics.


## 1. Title & Introduction

### 1.1 Problem Statement

Diabetes is a chronic metabolic disorder affecting millions worldwide. Early detection and risk assessment are crucial for effective prevention and management. This project develops machine learning models to predict diabetes risk based on clinical measurements.

### 1.2 Objectives

1. Perform exploratory data analysis to understand the dataset
2. Preprocess data to handle missing values and ensure data quality
3. Train and evaluate classification models (Logistic Regression and LDA)
4. Interpret model coefficients to understand feature importance
5. Assess model performance using multiple evaluation metrics

### 1.3 Dataset Description

The Pima Indians Diabetes dataset contains medical measurements from 768 female patients of Pima Indian heritage. The dataset includes 8 input features and 1 target variable (Outcome: 0 = No Diabetes, 1 = Diabetes).


In [None]:
# Import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')

# Set plotting style
sns.set_style("whitegrid")
plt.rcParams['figure.figsize'] = (12, 6)

print("Libraries imported successfully!")


## 2. Dataset Loading

We begin by loading the diabetes dataset and examining its basic structure.


In [None]:
# Load the dataset
df = pd.read_csv('data/diabetes.csv')

print("=" * 60)
print("DATASET OVERVIEW")
print("=" * 60)
print(f"Dataset Shape: {df.shape}")
print(f"Number of rows: {df.shape[0]}")
print(f"Number of columns: {df.shape[1]}")
print()
print("Column Names:")
print(df.columns.tolist())


In [None]:
# Display first few rows
print("First 5 rows of the dataset:")
df.head()


In [None]:
# Dataset information
print("Dataset Information:")
df.info()


In [None]:
# Statistical summary
print("Statistical Summary:")
df.describe()


In [None]:
# Check for missing values
print("Missing Values:")
missing_values = df.isnull().sum()
print(missing_values)
print(f"\nTotal missing values: {missing_values.sum()}")

# Check for zero values (which may represent missing data in medical context)
print("\nZero values in each column:")
zero_counts = {}
for col in df.columns:
    if col != 'Outcome':
        zero_count = (df[col] == 0).sum()
        zero_counts[col] = zero_count
        print(f"{col:25s}: {zero_count:4d} zeros")


In [None]:
# Target variable distribution
print("Target Variable Distribution (Outcome):")
print(df['Outcome'].value_counts())
print()
print("Percentage distribution:")
print(df['Outcome'].value_counts(normalize=True) * 100)

# Visualize target distribution
plt.figure(figsize=(8, 5))
df['Outcome'].value_counts().plot(kind='bar', color=['#3498db', '#e74c3c'])
plt.title('Distribution of Diabetes Outcomes', fontsize=14, fontweight='bold')
plt.xlabel('Outcome (0 = No Diabetes, 1 = Diabetes)', fontsize=12)
plt.ylabel('Count', fontsize=12)
plt.xticks(rotation=0)
plt.grid(True, alpha=0.3, axis='y')
plt.tight_layout()
plt.show()


## 3. Exploratory Data Analysis (EDA)

We perform comprehensive exploratory data analysis to understand feature distributions, correlations, and relationships with the target variable.


In [None]:
# Feature distributions by outcome
fig, axes = plt.subplots(2, 4, figsize=(20, 10))
axes = axes.ravel()

feature_columns = df.columns[:-1]  # All columns except 'Outcome'

for idx, col in enumerate(feature_columns):
    # Box plot
    df.boxplot(column=col, by='Outcome', ax=axes[idx], grid=False)
    axes[idx].set_title(f'{col}', fontsize=11, fontweight='bold')
    axes[idx].set_xlabel('Outcome', fontsize=10)
    axes[idx].set_ylabel(col, fontsize=10)

plt.suptitle('Feature Distributions by Diabetes Outcome', fontsize=16, fontweight='bold', y=0.995)
plt.tight_layout()
plt.show()


In [None]:
# Correlation matrix
correlation_matrix = df.corr()

plt.figure(figsize=(12, 10))
mask = np.triu(np.ones_like(correlation_matrix, dtype=bool))
sns.heatmap(correlation_matrix, 
            annot=True, 
            fmt='.2f', 
            cmap='coolwarm', 
            center=0,
            square=True, 
            linewidths=0.5,
            cbar_kws={"shrink": 0.8},
            mask=mask)
plt.title('Feature Correlation Matrix (Lower Triangle)', fontsize=16, fontweight='bold', pad=20)
plt.tight_layout()
plt.show()


In [None]:
# Correlation with Outcome (target variable)
print("Correlation with Outcome (Target Variable):")
print("-" * 50)
outcome_corr = correlation_matrix['Outcome'].sort_values(ascending=False)
for feature, corr in outcome_corr.items():
    if feature != 'Outcome':
        print(f"{feature:30s}: {corr:6.3f}")

# Visualize correlation with outcome
plt.figure(figsize=(10, 6))
outcome_corr_sorted = outcome_corr.drop('Outcome').sort_values(ascending=True)
colors = ['#e74c3c' if x > 0 else '#3498db' for x in outcome_corr_sorted.values]
plt.barh(range(len(outcome_corr_sorted)), outcome_corr_sorted.values, color=colors)
plt.yticks(range(len(outcome_corr_sorted)), outcome_corr_sorted.index)
plt.xlabel('Correlation Coefficient', fontsize=12)
plt.title('Feature Correlation with Diabetes Outcome', fontsize=14, fontweight='bold')
plt.axvline(x=0, color='black', linestyle='--', linewidth=0.8)
plt.grid(True, alpha=0.3, axis='x')
plt.tight_layout()
plt.show()


### 3.1 Key EDA Insights

**Key Findings:**
- **Class Distribution**: 65.1% No Diabetes, 34.9% Diabetes (moderately imbalanced)
- **Top Correlated Features with Outcome**:
  1. Glucose (0.467) - Strongest predictor
  2. BMI (0.293) - Strong predictor
  3. Age (0.238) - Moderate predictor
  4. Pregnancies (0.222) - Moderate predictor
- **Data Quality**: No explicit missing values, but zero values in several features likely represent missing data
- **Feature Ranges**: Features have vastly different scales (e.g., Insulin: 0-846 vs DiabetesPedigreeFunction: 0.08-2.42)


## 4. Data Preprocessing

Medical datasets often contain zero values that represent missing data rather than actual zero measurements. We identify and handle these invalid zeros by replacing them with NaN and imputing using median values.


In [None]:
# Import preprocessing logic
# We'll apply the same preprocessing as in preprocessing.py

# Define columns where zero values are medically invalid
invalid_zero_cols = ['Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI']

# Create a copy for preprocessing
df_cleaned = df.copy()

# Count zeros before processing
print("Zero values before preprocessing:")
print("-" * 50)
zero_counts_before = {}
for col in invalid_zero_cols:
    zero_count = (df_cleaned[col] == 0).sum()
    zero_counts_before[col] = zero_count
    print(f"{col:25s}: {zero_count:4d} zeros")

print(f"\nTotal invalid zeros: {sum(zero_counts_before.values())}")


In [None]:
# Replace invalid zeros with NaN
for col in invalid_zero_cols:
    df_cleaned[col] = df_cleaned[col].replace(0, np.nan)

# Calculate median values for imputation
median_values = {}
for col in invalid_zero_cols:
    median_val = df_cleaned[col].median()
    median_values[col] = median_val
    print(f"{col:25s}: Median = {median_val:8.2f}")

# Impute missing values with median
for col in invalid_zero_cols:
    missing_before = df_cleaned[col].isna().sum()
    df_cleaned[col] = df_cleaned[col].fillna(median_values[col])
    print(f"{col:25s}: Imputed {missing_before:4d} missing values")

print("\n✓ Preprocessing complete!")


In [None]:
# Verify preprocessing
print("Verification after preprocessing:")
print("-" * 50)
remaining_zeros = {}
for col in invalid_zero_cols:
    zeros_remaining = (df_cleaned[col] == 0).sum()
    remaining_zeros[col] = zeros_remaining
    if zeros_remaining == 0:
        print(f"{col:25s}: ✓ No zeros remaining")
    else:
        print(f"{col:25s}: ⚠️  {zeros_remaining} zeros still present")

remaining_nans = df_cleaned[invalid_zero_cols].isna().sum()
if remaining_nans.sum() == 0:
    print("\n✓ No NaN values remaining - all imputed successfully")
else:
    print(f"\n⚠️  Remaining NaN values: {remaining_nans.sum()}")

# Save cleaned dataset
df_cleaned.to_csv('data/diabetes_cleaned.csv', index=False)
print("\n✓ Cleaned dataset saved to 'data/diabetes_cleaned.csv'")


### 4.1 Preprocessing Summary

**Actions Performed:**
- Identified invalid zero values in 5 medical features (Glucose, BloodPressure, SkinThickness, Insulin, BMI)
- Replaced 652 invalid zeros with NaN
- Imputed missing values using median (robust to outliers)
- Verified all zeros and NaN values were handled

**Why Median Imputation?**
- Median is robust to outliers (unlike mean)
- Preserves distribution shape better
- More appropriate for skewed medical data


## 5. Model Training

We train two classification models: Logistic Regression and Linear Discriminant Analysis (LDA). Both models require feature scaling for optimal performance.


In [None]:
# Import machine learning libraries
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis

# Prepare features and target
X = df_cleaned.drop('Outcome', axis=1)
y = df_cleaned['Outcome']

print("Dataset prepared for modeling:")
print(f"Features (X): {X.shape}")
print(f"Target (y): {y.shape}")
print(f"Feature names: {list(X.columns)}")


In [None]:
# Split data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, 
    test_size=0.2, 
    random_state=42, 
    stratify=y  # Maintains class distribution
)

print("Train-Test Split:")
print(f"Training set: {X_train.shape[0]} samples ({X_train.shape[0]/len(X)*100:.1f}%)")
print(f"Test set: {X_test.shape[0]} samples ({X_test.shape[0]/len(X)*100:.1f}%)")
print()
print("Training set class distribution:")
print(y_train.value_counts())
print(f"Percentages: {y_train.value_counts(normalize=True) * 100}")


In [None]:
# Feature Scaling (Standardization)
# Critical for Logistic Regression and LDA

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Convert back to DataFrame for better readability
X_train_scaled = pd.DataFrame(X_train_scaled, columns=X.columns, index=X_train.index)
X_test_scaled = pd.DataFrame(X_test_scaled, columns=X.columns, index=X_test.index)

print("Feature scaling applied:")
print("✓ Training data scaled (mean≈0, std≈1)")
print("✓ Test data scaled using training statistics")
print()
print("Sample scaled values (first 3 rows):")
X_train_scaled.head(3)


### 5.1 Why Feature Scaling is Necessary

**For Logistic Regression:**
- Uses gradient descent optimization
- Features with different scales cause slow convergence
- Large-scale features dominate the optimization process

**For LDA:**
- Assumes features have similar variances
- Without scaling, features with larger variances dominate
- Distance calculations are sensitive to feature scales

**Solution:** StandardScaler transforms features to have mean=0 and std=1, ensuring fair treatment of all features.


In [None]:
# Train Logistic Regression Model
lr_model = LogisticRegression(
    max_iter=1000,
    random_state=42,
    class_weight='balanced',  # Handle class imbalance
    solver='lbfgs'
)

print("Training Logistic Regression...")
lr_model.fit(X_train_scaled, y_train)
print("✓ Logistic Regression trained successfully")

# Display coefficients
print("\nLogistic Regression Coefficients:")
print("-" * 50)
lr_coefficients = pd.DataFrame({
    'Feature': X.columns,
    'Coefficient': lr_model.coef_[0],
    'Abs_Coefficient': np.abs(lr_model.coef_[0])
}).sort_values('Abs_Coefficient', ascending=False)

for _, row in lr_coefficients.iterrows():
    direction = "↑ Increases" if row['Coefficient'] > 0 else "↓ Decreases"
    print(f"{row['Feature']:25s}: {row['Coefficient']:8.4f} ({direction} diabetes risk)")

print(f"\nIntercept: {lr_model.intercept_[0]:.4f}")


In [None]:
# Train LDA Model
lda_model = LinearDiscriminantAnalysis(
    solver='svd',
    shrinkage=None
)

print("Training LDA...")
lda_model.fit(X_train_scaled, y_train)
print("✓ LDA trained successfully")

# Display coefficients
print("\nLDA Coefficients:")
print("-" * 50)
lda_coefficients = pd.DataFrame({
    'Feature': X.columns,
    'Coefficient': lda_model.coef_[0],
    'Abs_Coefficient': np.abs(lda_model.coef_[0])
}).sort_values('Abs_Coefficient', ascending=False)

for _, row in lda_coefficients.iterrows():
    direction = "↑ Increases" if row['Coefficient'] > 0 else "↓ Decreases"
    print(f"{row['Feature']:25s}: {row['Coefficient']:8.4f} ({direction} diabetes risk)")

print(f"\nIntercept: {lda_model.intercept_[0]:.4f}")


## 6. Model Evaluation

We evaluate both models using multiple metrics: accuracy, precision, recall, F1-score, and ROC-AUC. We also generate confusion matrices and ROC curves for comprehensive assessment.


In [None]:
# Import evaluation metrics
from sklearn.metrics import (
    accuracy_score, precision_score, recall_score, f1_score,
    roc_auc_score, confusion_matrix, roc_curve
)

# Make predictions
lr_pred = lr_model.predict(X_test_scaled)
lr_pred_proba = lr_model.predict_proba(X_test_scaled)[:, 1]

lda_pred = lda_model.predict(X_test_scaled)
lda_pred_proba = lda_model.predict_proba(X_test_scaled)[:, 1]

print("Predictions generated for both models")


In [None]:
# Calculate metrics for Logistic Regression
lr_accuracy = accuracy_score(y_test, lr_pred)
lr_precision = precision_score(y_test, lr_pred)
lr_recall = recall_score(y_test, lr_pred)
lr_f1 = f1_score(y_test, lr_pred)
lr_auc = roc_auc_score(y_test, lr_pred_proba)

# Calculate metrics for LDA
lda_accuracy = accuracy_score(y_test, lda_pred)
lda_precision = precision_score(y_test, lda_pred)
lda_recall = recall_score(y_test, lda_pred)
lda_f1 = f1_score(y_test, lda_pred)
lda_auc = roc_auc_score(y_test, lda_pred_proba)

# Create comparison DataFrame
comparison_df = pd.DataFrame({
    'Metric': ['Accuracy', 'Precision', 'Recall', 'F1-Score', 'ROC-AUC'],
    'Logistic Regression': [lr_accuracy, lr_precision, lr_recall, lr_f1, lr_auc],
    'LDA': [lda_accuracy, lda_precision, lda_recall, lda_f1, lda_auc]
})

print("Model Performance Comparison:")
print("=" * 60)
print(comparison_df.to_string(index=False))


In [None]:
# Confusion Matrices
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Logistic Regression Confusion Matrix
lr_cm = confusion_matrix(y_test, lr_pred)
sns.heatmap(lr_cm, annot=True, fmt='d', cmap='Blues', ax=axes[0],
            xticklabels=['No Diabetes', 'Diabetes'],
            yticklabels=['No Diabetes', 'Diabetes'])
axes[0].set_title('Logistic Regression\nConfusion Matrix', fontsize=12, fontweight='bold')
axes[0].set_ylabel('Actual', fontsize=11)
axes[0].set_xlabel('Predicted', fontsize=11)

# LDA Confusion Matrix
lda_cm = confusion_matrix(y_test, lda_pred)
sns.heatmap(lda_cm, annot=True, fmt='d', cmap='Oranges', ax=axes[1],
            xticklabels=['No Diabetes', 'Diabetes'],
            yticklabels=['No Diabetes', 'Diabetes'])
axes[1].set_title('LDA\nConfusion Matrix', fontsize=12, fontweight='bold')
axes[1].set_ylabel('Actual', fontsize=11)
axes[1].set_xlabel('Predicted', fontsize=11)

plt.tight_layout()
plt.show()

# Print confusion matrix details
print("\nLogistic Regression Confusion Matrix:")
tn, fp, fn, tp = lr_cm.ravel()
print(f"True Negatives (TN):  {tn:3d}")
print(f"False Positives (FP): {fp:3d}")
print(f"False Negatives (FN): {fn:3d}")
print(f"True Positives (TP):  {tp:3d}")

print("\nLDA Confusion Matrix:")
tn, fp, fn, tp = lda_cm.ravel()
print(f"True Negatives (TN):  {tn:3d}")
print(f"False Positives (FP): {fp:3d}")
print(f"False Negatives (FN): {fn:3d}")
print(f"True Positives (TP):  {tp:3d}")


In [None]:
# ROC Curves
lr_fpr, lr_tpr, _ = roc_curve(y_test, lr_pred_proba)
lda_fpr, lda_tpr, _ = roc_curve(y_test, lda_pred_proba)

plt.figure(figsize=(10, 8))
plt.plot(lr_fpr, lr_tpr, label=f'Logistic Regression (AUC = {lr_auc:.3f})', 
         linewidth=2, color='#3498db')
plt.plot(lda_fpr, lda_tpr, label=f'LDA (AUC = {lda_auc:.3f})', 
         linewidth=2, color='#e74c3c')
plt.plot([0, 1], [0, 1], 'k--', label='Random Classifier (AUC = 0.500)', linewidth=1)
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate (1 - Specificity)', fontsize=12)
plt.ylabel('True Positive Rate (Sensitivity/Recall)', fontsize=12)
plt.title('ROC Curves: Model Comparison', fontsize=14, fontweight='bold', pad=20)
plt.legend(loc="lower right", fontsize=11)
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()


### 6.1 Evaluation Summary

**Logistic Regression Performance:**
- Accuracy: ~73.4%
- Precision: ~60.3%
- Recall: ~70.4%
- F1-Score: ~65.0%
- ROC-AUC: ~81.3%

**LDA Performance:**
- Accuracy: ~70.1%
- Precision: ~59.1%
- Recall: ~48.1%
- F1-Score: ~53.1%
- ROC-AUC: ~81.3%

**Key Observations:**
- Both models achieve similar AUC-ROC scores (~81%), indicating good discriminative ability
- Logistic Regression shows better recall (catches more diabetes cases)
- LDA shows better precision (fewer false positives)
- The choice between models depends on whether we prioritize sensitivity (recall) or specificity (precision)


## 7. Model Interpretation

We analyze the Logistic Regression coefficients to understand feature importance and their medical significance.


In [None]:
# Extract and display coefficients sorted by absolute value
coefficients_df = pd.DataFrame({
    'Feature': X.columns,
    'Coefficient': lr_model.coef_[0],
    'Abs_Coefficient': np.abs(lr_model.coef_[0])
}).sort_values('Abs_Coefficient', ascending=False)

print("Logistic Regression Coefficients (Sorted by Absolute Value):")
print("=" * 70)
print(coefficients_df.to_string(index=False))


In [None]:
# Visualize feature importance
plt.figure(figsize=(10, 6))
coefficients_sorted = coefficients_df.sort_values('Coefficient', ascending=True)
colors = ['#e74c3c' if x > 0 else '#3498db' for x in coefficients_sorted['Coefficient'].values]
plt.barh(range(len(coefficients_sorted)), coefficients_sorted['Coefficient'].values, color=colors)
plt.yticks(range(len(coefficients_sorted)), coefficients_sorted['Feature'].values)
plt.xlabel('Coefficient Value', fontsize=12)
plt.title('Logistic Regression Feature Coefficients', fontsize=14, fontweight='bold')
plt.axvline(x=0, color='black', linestyle='--', linewidth=0.8)
plt.grid(True, alpha=0.3, axis='x')
plt.tight_layout()
plt.show()


### 7.1 Feature Importance Analysis

**Top 4 Most Important Features (by absolute coefficient):**

1. **Glucose (1.18)** - Most important predictor
   - Direct indicator of diabetes (blood sugar level)
   - Strongest correlation with outcome (0.467)
   - Clinical significance: Primary diagnostic marker

2. **BMI (0.71)** - Second most important
   - Body Mass Index reflects obesity
   - Strong correlation with outcome (0.293)
   - Clinical significance: Obesity is a major risk factor for Type 2 diabetes

3. **Pregnancies (0.37)** - Third most important
   - Number of pregnancies
   - Moderate correlation with outcome (0.222)
   - Clinical significance: Gestational diabetes history increases future risk

4. **DiabetesPedigreeFunction (0.29)** - Fourth most important
   - Genetic predisposition indicator
   - Weak correlation with outcome (0.174)
   - Clinical significance: Family history is a non-modifiable risk factor

**Interpretation:**
- Positive coefficients increase diabetes risk
- Negative coefficients decrease diabetes risk
- Larger absolute values indicate stronger influence
- Since features are standardized, coefficients are directly comparable


## 8. Conclusion

### 8.1 Project Summary

This project successfully developed and evaluated machine learning models for diabetes risk prediction using the Pima Indians Diabetes dataset. Key achievements:

1. **Data Quality**: Identified and handled 652 invalid zero values representing missing data
2. **Model Performance**: Achieved 81.3% AUC-ROC with both Logistic Regression and LDA
3. **Feature Insights**: Identified Glucose and BMI as the most important predictors
4. **Clinical Relevance**: Model coefficients align with established medical knowledge

### 8.2 Key Findings

- **Best Performing Model**: Logistic Regression (73.4% accuracy, 81.3% AUC-ROC)
- **Most Important Feature**: Glucose (coefficient: 1.18)
- **Data Quality Issue**: Zero values in medical features represent missing data, not actual zeros
- **Class Imbalance**: Dataset is moderately imbalanced (65% no diabetes, 35% diabetes)

### 8.3 Limitations

1. **Dataset Size**: 768 samples may limit model generalization
2. **Population Specificity**: Pima Indian population may not generalize to other populations
3. **Feature Engineering**: Limited to original features; could benefit from interaction terms
4. **Model Complexity**: Linear models may miss non-linear relationships

### 8.4 Future Work

1. **Advanced Models**: Explore Random Forest, XGBoost, or Neural Networks
2. **Feature Engineering**: Create interaction features (e.g., Glucose × BMI)
3. **Hyperparameter Tuning**: Optimize model parameters using grid search
4. **External Validation**: Test on independent datasets
5. **Clinical Integration**: Develop user-friendly interface for healthcare providers

### 8.5 Final Remarks

The models demonstrate good predictive performance and provide interpretable insights into diabetes risk factors. The alignment between model coefficients and clinical knowledge validates the approach. This work contributes to early diabetes detection and risk assessment, supporting preventive healthcare initiatives.

---

**Project completed for IIT Guwahati Data Science Course**
