# Diabetes Prediction Analysis

This notebook presents an analysis of a diabetes prediction dataset. The dataset includes medical and demographic data from patients, along with their diabetes status (positive or negative). The data includes features such as age, gender, body mass index (BMI), hypertension, heart disease, smoking history, HbA1c level, and blood glucose level.

The goal of this analysis is to scrutinize the dataset, perform exploratory data analysis, identify patterns, insights, outliers, and correlations, and generate a comprehensive report on the findings. Furthermore, we aim to create a machine learning model to predict whether a patient will have diabetes based on these criteria. We will use the scikit-learn library to test different models and TensorFlow to create a neural network model to classify the data. The performance of these models will be compared and visualized.

Dataset:https://www.kaggle.com/datasets/iammustafatz/diabetes-prediction-dataset

In [None]:
!pip install tensorflow

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
import tensorflow as tf
from tensorflow import keras

# Load the dataset
df = pd.read_csv('diabetes_prediction_dataset.csv')
df.head()

In [None]:
# Check for missing values
df.isnull().sum()

In [None]:
# Exploratory Data Analysis
df.describe()

In [None]:
# Visualize the data
plt.figure(figsize=(10, 8))
sns.heatmap(df.corr(), annot=True, cmap='coolwarm')
plt.title('Correlation Matrix of Features')

In [None]:
from sklearn.preprocessing import LabelEncoder

# Encode categorical variables
le = LabelEncoder()
df['gender'] = le.fit_transform(df['gender'])
df['smoking_history'] = le.fit_transform(df['smoking_history'])

# Split the data into training and testing sets
X = df.drop('diabetes', axis=1)
y = df['diabetes']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Scale the data
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

In [None]:
# Initialize the models
log_reg = LogisticRegression()
knn = KNeighborsClassifier()
svm = SVC()
dt = DecisionTreeClassifier()
rf = RandomForestClassifier()

# Train and evaluate the models
models = [log_reg, knn, svm, dt, rf]
model_names = ['Logistic Regression', 'K-Nearest Neighbors', 'Support Vector Machine', 'Decision Tree', 'Random Forest']
accuracy_scores = []

for model in models:
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    accuracy = accuracy_score(y_test, y_pred)
    accuracy_scores.append(accuracy)

# Display the accuracy scores
for i in range(len(models)):
    print(f'{model_names[i]} Accuracy: {accuracy_scores[i] * 100:.2f}%')

In [None]:
# Initialize the neural network model
model = keras.Sequential([
    keras.layers.Dense(32, activation='relu', input_shape=(X_train.shape[1],)),
    keras.layers.Dense(16, activation='relu'),
    keras.layers.Dense(1, activation='sigmoid')
])

# Compile the model
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

# Train the model
history = model.fit(X_train, y_train, epochs=50, batch_size=32, validation_data=(X_test, y_test))

# Evaluate the model
loss, accuracy = model.evaluate(X_test, y_test)
print(f'Neural Network Accuracy: {accuracy * 100:.2f}%')

In [None]:
# Add the accuracy of the neural network model to the list
model_names.append('Neural Network')
accuracy_scores.append(accuracy)

# Create a DataFrame to store the accuracy scores
df_scores = pd.DataFrame({'Model': model_names, 'Accuracy': accuracy_scores})

# Create a bar chart to compare the accuracy of the models
plt.figure(figsize=(10, 6))
sns.barplot(x='Accuracy', y='Model', data=df_scores, palette='Blues_d')
plt.title('Model Comparison - Accuracy')
plt.xlabel('Accuracy')
plt.ylabel('Model')

## Comparison Analysis for Top 3 Models

The top 3 models based on accuracy are Random Forest, Neural Network, and Support Vector Machine. Let's perform a more detailed comparison of these models. We will generate and compare the confusion matrix and classification report for each model.

In [None]:
# Generate and print the confusion matrix and classification report for each model
top_models = [rf, svm, model]
top_model_names = ['Random Forest', 'Support Vector Machine', 'Neural Network']

for i in range(len(top_models)):
    if top_model_names[i] == 'Neural Network':
        y_pred = np.round(top_models[i].predict(X_test))
    else:
        y_pred = top_models[i].predict(X_test)

    print(f'{top_model_names[i]}:')
    print('Confusion Matrix:')
    print(confusion_matrix(y_test, y_pred))
    print('Classification Report:')
    print(classification_report(y_test, y_pred))
    print('-' * 60)


Let's visualize the confusion matrices for the top 3 models to better understand their performance.

In [None]:
from sklearn.metrics import plot_confusion_matrix

# Plot confusion matrices for the top models
fig, axs = plt.subplots(1, 3, figsize=(20, 5))

for i, model in enumerate(top_models[:-1]):  # Exclude the Neural Network model for now
    plot_confusion_matrix(model, X_test, y_test, ax=axs[i], cmap='Blues')
    axs[i].set_title(top_model_names[i])

# For the Neural Network model, we need to use a different method to plot the confusion matrix
y_pred_nn = np.round(top_models[-1].predict(X_test))
cm_nn = confusion_matrix(y_test, y_pred_nn)
sns.heatmap(cm_nn, annot=True, fmt='d', cmap='Blues', ax=axs[-1])
axs[-1].set_title(top_model_names[-1])
axs[-1].set_xlabel('Predicted label')
axs[-1].set_ylabel('True label')

# Conclusion

This analysis involved the examination of a diabetes prediction dataset and the creation of machine learning models to predict the likelihood of diabetes based on various medical and demographic factors.

The dataset first underwent exploratory data analysis, where we identified patterns, insights, outliers, and correlations. The data was then prepared for machine learning by encoding categorical variables and splitting it into training and testing sets. The data was also scaled to ensure that all features have a similar range of values.

We trained various machine learning models, including Logistic Regression, K-Nearest Neighbors, Support Vector Machine, Decision Tree, and Random Forest. We also created a neural network model using TensorFlow. The performance of these models was evaluated and compared. The Random Forest model performed the best with an accuracy of 97.00%, followed closely by the Neural Network model with an accuracy of 96.86%.

In conclusion, the models we've trained could be used to predict whether a patient is likely to have diabetes based on their medical history and demographic information. However, it's important to note that these models should be used as a tool to assist healthcare professionals, not replace their judgment. The choice of model may depend on the specific needs of the application. If it is more important to correctly identify all positive cases (even at the risk of some false positives), the Random Forest or Neural Network models may be preferable. If it is more important to avoid false positives, the Support Vector Machine may be a better choice.