# Diabetes Risk Prediction for SDG 3: Good Health and Well-Being

This notebook implements a simple machine learning model to predict diabetes risk using the Pima Indians Diabetes dataset. The project supports **SDG 3 – Good Health and Well-Being** by demonstrating how AI can help in early detection and preventive healthcare.

In [None]:
import os

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

## 1. Load the Dataset

Make sure you have downloaded `diabetes.csv` (Pima Indians Diabetes dataset) and placed it in the `data/` folder next to this project.

In [None]:
csv_path = os.path.join('..', 'data', 'diabetes.csv')

if not os.path.exists(csv_path):
    raise FileNotFoundError(
        f'Data file not found at {csv_path}. Please download diabetes.csv and place it in the data folder.'
    )

data = pd.read_csv(csv_path)
data.head()

## 2. Explore the Data

Check the basic structure and distribution of the target variable (`Outcome`).

In [None]:
data.info()

In [None]:
data.describe()

In [None]:
sns.countplot(x='Outcome', data=data)
plt.title('Distribution of Diabetes Outcome (0 = No, 1 = Yes)')
plt.show()

In [None]:
plt.figure(figsize=(10, 8))
sns.heatmap(data.corr(), annot=False, cmap='viridis')
plt.title('Correlation Heatmap')
plt.show()

## 3. Prepare Data for Modeling

Split the data into features (`X`) and target (`y`), then into training and test sets.

In [None]:
X = data.drop('Outcome', axis=1)
y = data['Outcome']

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

X_train.shape, X_test.shape

## 4. Feature Scaling

Standardize the features so that all variables are on a similar scale.

In [None]:
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

## 5. Train Logistic Regression Model


In [None]:
model = LogisticRegression(max_iter=1000)
model.fit(X_train_scaled, y_train)

## 6. Evaluate the Model

Use accuracy, classification report, and confusion matrix.

In [None]:
y_pred = model.predict(X_test_scaled)

acc = accuracy_score(y_test, y_pred)
print('Accuracy:', acc)

print('Classification Report:
', classification_report(y_test, y_pred))

cm = confusion_matrix(y_test, y_pred)
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues')
plt.xlabel('Predicted')
plt.ylabel('True')
plt.title('Confusion Matrix')
plt.show()

## 7. Feature Importance (Coefficients)


In [None]:
feature_names = X.columns
coefficients = model.coef_[0]

coef_df = pd.DataFrame({
    'feature': feature_names,
    'coefficient': coefficients
}).sort_values(by='coefficient', key=np.abs, ascending=False)

coef_df

In [None]:
plt.figure(figsize=(8, 5))
sns.barplot(data=coef_df, x='coefficient', y='feature')
plt.title('Feature Importance (Logistic Regression Coefficients)')
plt.tight_layout()
plt.show()

## 8. Reflection (for your report)

- How well did the model perform?
- Which features seem most important for predicting diabetes?
- What are the risks of false negatives and false positives?
- How can this kind of model support **SDG 3 – Good Health and Well-Being**?

Use this section to write notes that you will reuse in your 1-page article and presentation.