# SDG 3: Good Health and Well-being – Diabetes Prediction using Supervised Learning
### Author: Student Name
### Dataset: Pima Indians Diabetes Database (Kaggle)

This project supports **SDG 3 (Good Health and Well-being)** by using machine learning to predict diabetes likelihood based on patient health metrics. Early prediction can help in prevention and better healthcare planning.

In [8]:
# Import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
import warnings
warnings.filterwarnings('ignore')

ModuleNotFoundError: No module named 'sklearn'

In [None]:
# Load dataset
data = pd.read_csv('diabetes.csv')
data.head()

In [None]:
# Basic info
data.info()
data.describe()

In [None]:
# Check for missing values
print('Missing values per column:')
print(data.isnull().sum())

In [None]:
# Feature correlation heatmap
plt.figure(figsize=(10,6))
sns.heatmap(data.corr(), annot=True, cmap='coolwarm')
plt.title('Feature Correlation Heatmap')
plt.show()

In [None]:
# Split data into features and target
X = data.drop('Outcome', axis=1)
y = data['Outcome']

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
print('Training set:', X_train.shape, 'Test set:', X_test.shape)

In [None]:
# Train Random Forest Classifier
model = RandomForestClassifier(random_state=42)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

In [None]:
# Evaluate model
print('Accuracy:', accuracy_score(y_test, y_pred))
print('
Confusion Matrix:
', confusion_matrix(y_test, y_pred))
print('
Classification Report:
', classification_report(y_test, y_pred))

In [None]:
# Plot confusion matrix
plt.figure(figsize=(5,4))
sns.heatmap(confusion_matrix(y_test, y_pred), annot=True, fmt='d', cmap='Blues')
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.title('Confusion Matrix')
plt.show()

In [None]:
# Feature importance
importances = pd.Series(model.feature_importances_, index=X.columns).sort_values(ascending=False)
plt.figure(figsize=(10,6))
sns.barplot(x=importances, y=importances.index, palette='viridis')
plt.title('Feature Importance')
plt.xlabel('Importance Score')
plt.ylabel('Features')
plt.show()

## Ethical Reflection
- The dataset may not represent all demographics equally (gender, age, ethnicity), which can introduce bias.
- Predictions should be used to **support** healthcare decisions, not replace professionals.
- All patient data should remain **anonymized** to protect privacy.

## Conclusion
The Random Forest model achieved good accuracy in predicting diabetes likelihood. Glucose level, BMI, and age were among the most significant predictors.
This aligns with **SDG 3**, supporting early detection and promoting good health and well-being.