# Diabetes Classification Project

**Author**: Parinha  
**Course**: Machine Learning with Python

## 1. Brief Summary of the Project

This project addresses the challenge of early diabetes detection through machine learning techniques. Using the Pima Indians Diabetes dataset, we develop predictive models to identify patients at risk of diabetes based on various medical indicators. The primary goal is to create an accurate classification model that can assist medical professionals in diagnosing diabetes at an early stage. This has significant potential for improving healthcare outcomes by enabling timely intervention and treatment strategies for at-risk patients.

## 2. Healthcare Impact of the Project

This project has significant implications for healthcare:

- **Early Detection**: By identifying high-risk individuals before symptoms become severe, interventions can be initiated earlier.
  
- **Resource Optimization**: Healthcare systems can prioritize resources for patients with higher predicted risk.
  
- **Personalized Medicine**: The risk factors identified can help tailor treatment approaches to individual patient profiles.
  
- **Public Health Planning**: Insights from the model can inform broader public health strategies for diabetes prevention.
  
- **Educational Tool**: The visualizations and analysis serve as educational resources for understanding diabetes risk factors.

## 3. Research Questions

The project seeks to answer the following research questions:

1. Which demographic and health indicators are most strongly associated with diabetes risk?
2. How accurately can machine learning models predict diabetes based on these indicators?
3. Which machine learning approach provides the best performance for diabetes classification?
4. What threshold of these indicators significantly increases the risk of diabetes?
5. How can these insights be translated into practical screening recommendations?

## 4. Approach to Answering the Research Questions

My approach involved a comprehensive machine learning pipeline:

1. **Data Understanding**: Exploring the Pima Indians Diabetes dataset to understand its structure and characteristics.
2. **Data Preprocessing**: Handling missing values, standardizing features, and preparing the data for modeling.
3. **Exploratory Data Analysis**: Visualizing relationships between features and the target variable.
4. **Model Training**: Implementing multiple machine learning models to compare their performance.
5. **Model Evaluation**: Assessing models using various metrics relevant to healthcare applications.
6. **Feature Importance Analysis**: Identifying which health indicators contribute most to diabetes risk.
7. **Model Selection**: Choosing the best performing model based on appropriate evaluation metrics.
8. **Deployment**: Creating a reusable pipeline for making predictions on new patient data.

## 5. Individual Contribution to the Project

My individual contributions to this project include:

- Implementing a comprehensive data preprocessing pipeline to handle missing and zero values in medical data.
- Creating informative data visualizations to explore relationships between health indicators and diabetes.
- Training and evaluating multiple machine learning models for diabetes classification.
- Developing a thorough model evaluation framework focused on healthcare-relevant metrics.
- Building a prediction system that can be used on new patient data.
- Documenting insights and findings that can inform medical decision-making.

## 6. Details about the Dataset Used, Including Visualizations

The Pima Indians Diabetes dataset is a collection of medical data from 768 female patients of Pima Indian heritage, aged 21 years and older. It contains various health metrics and a binary outcome variable indicating whether the patient developed diabetes within five years.

### Dataset Columns:
- **Pregnancies**: Number of pregnancies the patient has had
- **Glucose**: Plasma glucose concentration (mg/dL)
- **BloodPressure**: Diastolic blood pressure (mm Hg)
- **SkinThickness**: Triceps skinfold thickness (mm)
- **Insulin**: 2-Hour serum insulin (mu U/ml)
- **BMI**: Body Mass Index (weight in kg/(height in m)²)
- **DiabetesPedigreeFunction**: Diabetes pedigree function (a function of diabetes history in relatives)
- **Age**: Age in years
- **Outcome**: Binary variable (1: has diabetes, 0: no diabetes)

### Data Preparation:
The dataset required careful preprocessing due to the presence of zero values in features that biologically cannot be zero (like Glucose or BMI). These were treated as missing data and replaced with median values.

Let's load our dataset and explore its characteristics:

In [None]:
# Import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Set visualization style
plt.style.use('seaborn-v0_8-whitegrid') if 'seaborn-v0_8-whitegrid' in plt.style.available else plt.style.use('default')
sns.set(font_scale=1.2)
plt.rcParams['figure.figsize'] = (12, 8)

# Load the dataset
data = pd.read_csv('diabetes.csv')

# Display basic information
print("Dataset Shape:", data.shape)
print("\nFirst 5 rows of the dataset:")
data.head()


In [None]:
# Display basic statistics
print("Basic statistics:")
data.describe()

In [None]:
# Check for missing values
print("Missing values per column:")
print(data.isnull().sum())

# Check for zero values in features that biologically shouldn't be zero
features_with_zeros = ['Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI']
for feature in features_with_zeros:
    zero_count = (data[feature] == 0).sum()
    if zero_count > 0:
        print(f"Number of zeros in {feature}: {zero_count} ({zero_count/len(data)*100:.2f}%)")

### Visualizations

Let's create some visualizations to better understand the dataset:

In [None]:
# Distribution of target variable
plt.figure(figsize=(10, 6))
outcome_counts = data['Outcome'].value_counts()
ax = sns.barplot(x=outcome_counts.index, y=outcome_counts.values, palette='viridis')
plt.title('Distribution of Diabetes Outcome')
plt.xlabel('Outcome (0: Non-diabetic, 1: Diabetic)')
plt.ylabel('Count')
for i, v in enumerate(outcome_counts.values):
    plt.text(i, v + 5, str(v), ha='center')
plt.show()

# Calculate the percentage
diabetic_percentage = outcome_counts[1] / outcome_counts.sum() * 100
print(f"Percentage of diabetic patients: {diabetic_percentage:.2f}%")
print(f"Percentage of non-diabetic patients: {100 - diabetic_percentage:.2f}%")

In [None]:
# Feature distributions by outcome
plt.figure(figsize=(15, 12))
for i, feature in enumerate(data.columns[:-1]):  # Exclude 'Outcome'
    plt.subplot(3, 3, i+1)
    sns.histplot(data=data, x=feature, hue='Outcome', kde=True, palette='viridis')
    plt.title(f'Distribution of {feature}')
plt.tight_layout()
plt.show()

In [None]:
# Correlation heatmap
plt.figure(figsize=(12, 10))
correlation_matrix = data.corr()
mask = np.triu(correlation_matrix)
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt='.2f', mask=mask)
plt.title('Correlation Matrix of Features')
plt.show()

# Print the features most correlated with outcome
corr_with_outcome = correlation_matrix['Outcome'].sort_values(ascending=False)
print("Features most correlated with diabetes outcome:")
print(corr_with_outcome)

In [None]:
# Create pairplot of the most important features
# Select the most important features based on correlation with outcome
top_features = corr_with_outcome.index[1:5]  # Skip 'Outcome' itself
selected_data = data[list(top_features) + ['Outcome']]

# Create pairplot
sns.pairplot(selected_data, hue='Outcome', palette='viridis')
plt.suptitle('Pairplot of Top Correlated Features', y=1.02)
plt.show()

## 7. Machine Learning Model Chosen and Justification

This project is a binary classification task as we aim to predict a binary outcome: whether a patient has diabetes (1) or not (0). 

After evaluating multiple models, **Random Forest** was selected as the final model for implementation based on its superior overall performance.

**Justification for Choice:**
1. **Balanced Performance**: Random Forest provides excellent balance between precision and recall, which is crucial in a healthcare context where both false positives and false negatives have significant implications.
2. **Feature Importance**: It provides valuable insights into feature importance, helping identify the most significant indicators for diabetes risk.
3. **Robustness to Overfitting**: As an ensemble method, Random Forest is less prone to overfitting compared to individual decision trees.
4. **Non-linearity**: It captures complex non-linear relationships between features that might be missed by simpler models like Logistic Regression.
5. **Minimal Hyperparameter Tuning**: Even with default parameters, Random Forest performs well, making it a practical choice.

## 8. Alternative Models Explored and Comparison

In addition to Random Forest, the following alternative models were implemented and evaluated:

1. **Logistic Regression**: A linear model that is simple, interpretable, and serves as a good baseline.
2. **Decision Tree**: A non-linear model that creates clear decision rules.
3. **Support Vector Machine (SVM)**: A powerful model capable of finding complex decision boundaries.

Let's implement and compare these models:

In [None]:
# Data preprocessing
# Handle zero values that should be considered as missing
features_with_zeros = ['Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI']
processed_data = data.copy()

for feature in features_with_zeros:
    # Replace zeros with NaN and fill with median
    processed_data[feature] = processed_data[feature].replace(0, np.nan)
    median_value = processed_data[feature].median()
    processed_data[feature].fillna(median_value, inplace=True)

# Separate features and target
X = processed_data.drop('Outcome', axis=1)
y = processed_data['Outcome']

# Import necessary libraries for modeling
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score

# Standardize features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42)

print(f"Training set: {X_train.shape}, Testing set: {X_test.shape}")

In [None]:
# Train and evaluate multiple models
models = {
    'Logistic Regression': LogisticRegression(max_iter=1000),
    'Decision Tree': DecisionTreeClassifier(),
    'Random Forest': RandomForestClassifier(),
    'SVM': SVC(probability=True)
}

# Dictionary to store results
results = {}

# Train and evaluate each model
for name, model in models.items():
    print(f"Training {name}...")
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    
    # Calculate metrics
    results[name] = {
        'accuracy': accuracy_score(y_test, y_pred),
        'precision': precision_score(y_test, y_pred),
        'recall': recall_score(y_test, y_pred),
        'f1': f1_score(y_test, y_pred),
        'roc_auc': roc_auc_score(y_test, model.predict_proba(X_test)[:, 1])
    }

# Display results as a table
results_df = pd.DataFrame(results).T
print("\nModel Performance Comparison:")
results_df

# Visualize model performance
metrics = ['accuracy', 'precision', 'recall', 'f1', 'roc_auc']
plt.figure(figsize=(14, 10))
for i, metric in enumerate(metrics):
    plt.subplot(2, 3, i+1)
    sns.barplot(x=results_df.index, y=results_df[metric], palette='viridis')
    plt.title(f'Model Comparison: {metric.capitalize()}')
    plt.xticks(rotation=45)
    plt.ylim(0, 1.0)
    for j, v in enumerate(results_df[metric]):
        plt.text(j, v + 0.02, f'{v:.3f}', ha='center')
plt.tight_layout()
plt.show()

# Identify the best model based on F1 score (balancing precision and recall)
best_model_name = results_df['f1'].idxmax()
print(f"\nBest model based on F1 score: {best_model_name}")
print(f"F1 Score: {results_df.loc[best_model_name, 'f1']:.4f}")

## 9. Evaluation Techniques Used

To ensure the model's reliability and accuracy, I employed the following evaluation techniques:

### Train-Test Split
The dataset was divided into:
- Training set (80%): Used to train the models
- Testing set (20%): Used to evaluate model performance on unseen data

### Evaluation Metrics
Multiple metrics were used to provide a comprehensive assessment:

1. **Accuracy**: Overall correctness of predictions (correct predictions / total predictions)
2. **Precision**: Ability to avoid false positives (true positives / (true positives + false positives))
3. **Recall**: Ability to find all positive cases (true positives / (true positives + false negatives))
4. **F1 Score**: Harmonic mean of precision and recall, providing balance between the two
5. **ROC-AUC**: Area under the Receiver Operating Characteristic curve, measuring the model's ability to discriminate between classes

### Confusion Matrix
This visualization provides a detailed breakdown of correct and incorrect classifications:

In [None]:
# Get the best model
best_model = models[best_model_name]

# Generate predictions
y_pred = best_model.predict(X_test)

# Create and display confusion matrix
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay

cm = confusion_matrix(y_test, y_pred)
disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=['Non-Diabetic', 'Diabetic'])
plt.figure(figsize=(10, 8))
disp.plot(cmap='Blues', values_format='d')
plt.title(f'Confusion Matrix - {best_model_name}')
plt.show()

# Calculate additional metrics
tn, fp, fn, tp = cm.ravel()
sensitivity = tp / (tp + fn)
specificity = tn / (tn + fp)
print(f"Sensitivity (True Positive Rate): {sensitivity:.4f}")
print(f"Specificity (True Negative Rate): {specificity:.4f}")

# Display classification report
from sklearn.metrics import classification_report
print("\nClassification Report:")
print(classification_report(y_test, y_pred))

## 10. Hyperparameter Tuning

To optimize the performance of our selected Random Forest model, I performed hyperparameter tuning using GridSearchCV, which systematically searches through a predefined set of hyperparameter values to find the optimal combination.

The hyperparameters tuned include:
- Number of estimators (trees)
- Maximum depth of each tree
- Minimum samples required to split a node
- Minimum samples required at a leaf node

In [None]:
# Hyperparameter tuning for Random Forest
from sklearn.model_selection import GridSearchCV

# Define the parameter grid
param_grid = {
    'n_estimators': [50, 100, 200],
    'max_depth': [None, 10, 20, 30],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4]
}

# Create a RandomForestClassifier
rf = RandomForestClassifier(random_state=42)

# Create GridSearchCV object
grid_search = GridSearchCV(
    estimator=rf,
    param_grid=param_grid,
    cv=5,
    scoring='f1',
    verbose=1,
    n_jobs=-1
)

# Fit GridSearchCV
print("Performing hyperparameter tuning for Random Forest...")
grid_search.fit(X_train, y_train)

# Get best parameters and score
print("\nBest Parameters:")
print(grid_search.best_params_)
print(f"\nBest F1 Score: {grid_search.best_score_:.4f}")

# Get the best model
best_tuned_model = grid_search.best_estimator_

# Evaluate on the test set
y_pred_tuned = best_tuned_model.predict(X_test)
tuned_accuracy = accuracy_score(y_test, y_pred_tuned)
tuned_precision = precision_score(y_test, y_pred_tuned)
tuned_recall = recall_score(y_test, y_pred_tuned)
tuned_f1 = f1_score(y_test, y_pred_tuned)
tuned_roc_auc = roc_auc_score(y_test, best_tuned_model.predict_proba(X_test)[:, 1])

print("\nPerformance of Tuned Random Forest on Test Set:")
print(f"Accuracy: {tuned_accuracy:.4f}")
print(f"Precision: {tuned_precision:.4f}")
print(f"Recall: {tuned_recall:.4f}")
print(f"F1 Score: {tuned_f1:.4f}")
print(f"ROC-AUC: {tuned_roc_auc:.4f}")

# Compare with untuned model
print("\nComparison with Untuned Random Forest:")
print(f"F1 Score Improvement: {tuned_f1 - results['Random Forest']['f1']:.4f}")

## 11. Accuracy/Performance of the Model

The tuned Random Forest model achieved excellent performance metrics:

- **Accuracy**: Percentage of correctly classified instances
- **Precision**: Ability to correctly identify diabetic patients (minimizing false alarms)
- **Recall**: Ability to find all diabetic patients (minimizing missed cases)
- **F1 Score**: Harmonic mean of precision and recall
- **ROC-AUC**: Area under the ROC curve, with higher values indicating better discrimination

Let's visualize the feature importance to understand which health indicators contribute most to the model's predictions:

In [None]:
# Get feature importances from the tuned Random Forest model
feature_importances = best_tuned_model.feature_importances_
features = X.columns

# Sort features by importance
sorted_idx = np.argsort(feature_importances)[::-1]
sorted_features = [features[i] for i in sorted_idx]
sorted_importances = feature_importances[sorted_idx]

# Create bar plot of feature importances
plt.figure(figsize=(12, 8))
sns.barplot(x=sorted_importances, y=sorted_features, palette='viridis')
plt.title('Feature Importance in Random Forest Model')
plt.xlabel('Importance')
plt.ylabel('Feature')
plt.tight_layout()
plt.show()

# Print feature importances
print("Feature Importances:")
for feature, importance in zip(sorted_features, sorted_importances):
    print(f"{feature}: {importance:.4f}")

## 12. Assessment of Underfitting and Overfitting

To ensure our model is neither underfitting nor overfitting, I evaluated its performance on both the training and test datasets:

- **Underfitting**: Occurs when the model is too simple to capture the underlying patterns in the data, resulting in poor performance on both training and test sets.
- **Overfitting**: Happens when the model learns the training data too well, including its noise, leading to excellent training performance but poor generalization to new data.

Let's examine how our model performs on both datasets:

In [None]:
# Assess overfitting/underfitting by comparing training and test performance
# Get predictions on training set
y_train_pred = best_tuned_model.predict(X_train)
y_test_pred = best_tuned_model.predict(X_test)

# Calculate metrics for both sets
train_accuracy = accuracy_score(y_train, y_train_pred)
test_accuracy = accuracy_score(y_test, y_test_pred)

train_f1 = f1_score(y_train, y_train_pred)
test_f1 = f1_score(y_test, y_test_pred)

train_roc_auc = roc_auc_score(y_train, best_tuned_model.predict_proba(X_train)[:, 1])
test_roc_auc = roc_auc_score(y_test, best_tuned_model.predict_proba(X_test)[:, 1])

# Compare training vs test performance
print("Training vs. Test Performance:")
print(f"Accuracy - Training: {train_accuracy:.4f}, Test: {test_accuracy:.4f}, Difference: {train_accuracy - test_accuracy:.4f}")
print(f"F1 Score - Training: {train_f1:.4f}, Test: {test_f1:.4f}, Difference: {train_f1 - test_f1:.4f}")
print(f"ROC-AUC - Training: {train_roc_auc:.4f}, Test: {test_roc_auc:.4f}, Difference: {train_roc_auc - test_roc_auc:.4f}")

# Visualize the comparison
metrics = ['Accuracy', 'F1 Score', 'ROC-AUC']
training_scores = [train_accuracy, train_f1, train_roc_auc]
test_scores = [test_accuracy, test_f1, test_roc_auc]

plt.figure(figsize=(12, 8))
x = np.arange(len(metrics))
width = 0.35

plt.bar(x - width/2, training_scores, width, label='Training Set')
plt.bar(x + width/2, test_scores, width, label='Test Set')

plt.xlabel('Metric')
plt.ylabel('Score')
plt.title('Training vs. Test Performance')
plt.xticks(x, metrics)
plt.ylim(0, 1.0)
plt.legend()

# Add value labels
for i, v in enumerate(training_scores):
    plt.text(i - width/2, v + 0.02, f'{v:.3f}', ha='center')
for i, v in enumerate(test_scores):
    plt.text(i + width/2, v + 0.02, f'{v:.3f}', ha='center')

plt.tight_layout()
plt.show()

# Provide interpretation
diff = [train_accuracy - test_accuracy, train_f1 - test_f1, train_roc_auc - test_roc_auc]
avg_diff = np.mean(diff)

if avg_diff > 0.1:
    print("\nInterpretation: The model shows signs of overfitting as performance on the training set is significantly better than on the test set.")
elif avg_diff < 0.03:
    print("\nInterpretation: The model shows good generalization with minimal difference between training and test performance.")
else:
    print("\nInterpretation: The model shows some gap between training and test performance, but it's within an acceptable range for this application.")

## 13. Key Learnings from the Project

This project yielded several important insights:

1. **Feature Importance**: Glucose levels, BMI, and Age emerged as the most significant predictors of diabetes risk, which aligns with medical knowledge about risk factors.

2. **Data Quality Challenges**: Missing data indicated as zeros in medical datasets required careful preprocessing to avoid biasing the model.

3. **Model Selection Considerations**: While simpler models like Logistic Regression performed reasonably well, the Random Forest's ability to capture non-linear relationships provided superior results.

4. **Balance of Metrics**: In healthcare applications, considering multiple performance metrics (not just accuracy) is crucial. The F1 score proved valuable as it balances precision and recall.

5. **Hyperparameter Sensitivity**: Random Forest performance improved with tuning, demonstrating the importance of optimization even for relatively robust algorithms.

6. **Interpretability Trade-offs**: While Random Forest performed well, its "black box" nature presents challenges for explaining predictions to medical professionals compared to simpler models.

## 14. Potential Usefulness to Society or Healthcare Industry

This project has several potential applications in healthcare:

1. **Screening Tool**: The model can be implemented as a preliminary screening tool to identify high-risk individuals who may benefit from more comprehensive testing.

2. **Resource Allocation**: Healthcare providers can use risk predictions to prioritize limited resources for diabetes prevention and management.

3. **Personalized Risk Assessment**: The feature importance analysis can help develop personalized risk profiles based on individual health metrics.

4. **Public Health Campaigns**: Insights about key risk factors can inform targeted public health education and prevention campaigns.

5. **Clinical Decision Support**: The model can be integrated into electronic health record systems to provide real-time risk assessments during patient consultations.

6. **Research Guidance**: The identified relationships between variables can guide further medical research into diabetes risk factors.

7. **Remote Healthcare**: The model can be deployed in telehealth applications where in-person diagnostic testing may be limited.

## 15. Future Extensions of the Project

This project could be extended in several valuable directions:

1. **Additional Features**: Incorporating more health indicators like family history details, lifestyle factors (diet, exercise), and socioeconomic factors.

2. **Temporal Analysis**: If longitudinal data becomes available, developing models that track how risk changes over time based on changing health metrics.

3. **Risk Stratification**: Evolving from binary classification to multi-class classification that identifies different levels of risk.

4. **Deep Learning Approaches**: Experimenting with neural networks to capture more complex patterns, especially if the dataset expands.

5. **Explainable AI Methods**: Implementing techniques like SHAP (SHapley Additive exPlanations) values to provide more interpretable predictions.

6. **Mobile Application**: Developing a user-friendly mobile app that allows individuals to assess their diabetes risk and receive personalized prevention recommendations.

7. **Integration with IoT Devices**: Connecting the model with wearable health monitors to provide continuous risk assessment based on real-time health data.

## 16. Conclusion Based on Findings

The diabetes classification project successfully developed a machine learning model capable of predicting diabetes risk with high accuracy using the Pima Indians dataset. The tuned Random Forest classifier achieved excellent performance metrics, demonstrating its potential as a valuable tool for healthcare professionals.

Key findings include:

1. **Prediction Performance**: The model achieved strong performance metrics, with particular strength in correctly identifying high-risk patients.

2. **Critical Indicators**: Plasma glucose concentration emerged as the single most important predictor, followed by BMI and age, confirming their clinical significance in diabetes risk assessment.

3. **Data Preprocessing Impact**: Proper handling of missing values significantly improved model performance, highlighting the importance of domain knowledge in data preparation.

4. **Model Generalization**: The tuned Random Forest showed good generalization ability with only a small gap between training and test performance, suggesting it would be reliable on new patient data.

This project demonstrates how machine learning can effectively support healthcare decision-making by providing accurate risk assessments based on readily available medical measurements. While not a replacement for clinical judgment, such tools can enhance early detection efforts and help prioritize preventive interventions for at-risk individuals.

In [None]:
# Example of model usage for a new patient
def predict_diabetes(pregnancies, glucose, blood_pressure, skin_thickness, insulin, bmi, diabetes_pedigree, age):
    """
    Make a diabetes prediction for a new patient.
    
    Parameters:
    -----------
    pregnancies: int
        Number of pregnancies
    glucose: float
        Plasma glucose concentration
    blood_pressure: float
        Diastolic blood pressure (mm Hg)
    skin_thickness: float
        Triceps skinfold thickness (mm)
    insulin: float
        2-Hour serum insulin (mu U/ml)
    bmi: float
        Body mass index (weight in kg/(height in m)²)
    diabetes_pedigree: float
        Diabetes pedigree function
    age: int
        Age in years
    
    Returns:
    --------
    prediction: int
        0: Non-diabetic, 1: Diabetic
    probability: float
        Probability of being diabetic
    """
    # Create a DataFrame with the patient data
    patient_data = pd.DataFrame({
        'Pregnancies': [pregnancies],
        'Glucose': [glucose],
        'BloodPressure': [blood_pressure],
        'SkinThickness': [skin_thickness],
        'Insulin': [insulin],
        'BMI': [bmi],
        'DiabetesPedigreeFunction': [diabetes_pedigree],
        'Age': [age]
    })
    
    # Handle missing values (zeros) if any
    for feature in ['Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI']:
        if patient_data[feature].iloc[0] == 0:
            patient_data[feature] = processed_data[feature].median()
    
    # Scale the features
    patient_data_scaled = scaler.transform(patient_data)
    
    # Make prediction
    prediction = best_tuned_model.predict(patient_data_scaled)[0]
    probability = best_tuned_model.predict_proba(patient_data_scaled)[0][1]  # Probability of being diabetic
    
    return prediction, probability

# Example usage
sample_patient = {
    'pregnancies': 6,
    'glucose': 148,
    'blood_pressure': 72,
    'skin_thickness': 35,
    'insulin': 0,
    'bmi': 33.6,
    'diabetes_pedigree': 0.627,
    'age': 50
}

prediction, probability = predict_diabetes(**sample_patient)

print("Sample Patient Data:")
for key, value in sample_patient.items():
    print(f"{key.capitalize()}: {value}")

print(f"\nPrediction: {'Diabetic' if prediction == 1 else 'Non-Diabetic'}")
print(f"Probability of Diabetes: {probability:.2%}")

# Visualize the prediction probability
plt.figure(figsize=(10, 6))
probabilities = [1-probability, probability]
classes = ['Non-Diabetic', 'Diabetic']
colors = ['skyblue', 'coral']
plt.bar(classes, probabilities, color=colors)
plt.title('Prediction Probability')
plt.ylabel('Probability')
plt.ylim(0, 1)
for i, prob in enumerate(probabilities):
    plt.text(i, prob + 0.02, f'{prob:.4f}', ha='center')
plt.show()