**CS-634: Data Mining - Final Project**  
**Student Name:** Benyamin Plaksienko  
**Instructor Name:** Yasser Abduallah  

---

**Project Title:**  
Predicting Diabetes Using Supervised Data Mining (Classification) Binary
Classification,Long Short-Term Memory, Gaussian Naive Bayes, and Random Forest Algorithm 

---
**Note: this program requires the following prereqs**

- **Python Version**: 3.8.20
- **Conda Version**: 24.11.3
- **Python Libraries**:

In [1]:
from sklearn.model_selection import KFold
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import confusion_matrix
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import GaussianNB
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import LSTM, Dense
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.layers import Input
from tensorflow.keras.layers import Input, LSTM, Dense
from tensorflow.keras.models import Sequential
from tensorflow.keras.optimizers import Adam
import numpy as np
import pandas as pd

# Diabetes Data Analysis

This project uses the **Diabetes Dataset**, retrieved from [Kaggle](https://www.kaggle.com/datasets/mathchi/diabetes-data-set).

# Classifier Performance Evaluation

## Key Terminology

- **True Positive (TP):** The number of positive examples correctly predicted by the model.
- **True Negative (TN):** The number of negative examples correctly predicted by the model.
- **False Positive (FP):** The number of negative examples wrongly predicted as positive by the model.
- **False Negative (FN):** The number of positive examples wrongly predicted as negative by the model.

Let \( P \) be the number of positive examples:  
$$
P = TP + FN
$$
Let \( N \) be the number of negative examples:  
$$
N = TN + FP
$$

## Evaluation Metrics

### True Positive Rate (TPR) or Sensitivity
The fraction of positive examples predicted correctly by the model:  
$$
TPR = \frac{TP}{TP + FN} = \frac{TP}{P}
$$

### True Negative Rate (TNR) or Specificity
The fraction of negative examples predicted correctly by the model:  
$$
TNR = \frac{TN}{TN + FP} = \frac{TN}{N}
$$

### False Positive Rate (FPR)
The fraction of negative examples predicted as positive:  
$$
FPR = \frac{FP}{TN + FP} = \frac{FP}{N}
$$

### False Negative Rate (FNR)
The fraction of positive examples predicted as negative:  
$$
FNR = \frac{FN}{TP + FN} = \frac{FN}{P}
$$

### Precision (p)
The quality of the positive prediction:  
$$
Precision = \frac{TP}{TP + FP}
$$

### Recall (r) or Sensitivity
The same as **True Positive Rate (TPR):**  
$$
Recall = \frac{TP}{TP + FN} = \frac{TP}{P}
$$

### F1 Measure
The harmonic mean of Precision and Recall:  
$$
F1 = \frac{2 \times Precision \times Recall}{Precision + Recall}
$$

### Accuracy (Acc)
The proportion of correctly predicted labels:  
$$
Accuracy = \frac{TP + TN}{TP + FP + FN + TN} = \frac{TP + TN}{P + N}
$$

### Error Rate (Err)
The proportion of incorrect predictions:  
$$
ErrorRate = \frac{FP + FN}{TP + FP + FN + TN} = \frac{FP + FN}{P + N}
$$

---

## Additional Metrics

### Balanced Accuracy (BACC)
The average of **True Positive Rate (TPR)** and **True Negative Rate (TNR):**  
$$
Balanced\_Accuracy = \frac{TPR + TNR}{2}
$$

### True Skill Statistic (TSS)
The difference between **True Positive Rate (TPR)** and **False Positive Rate (FPR):**  
$$
TSS = TPR - FPR
$$

### Heidke Skill Score (HSS)
A measure of prediction over random prediction:  
$$
HSS = \frac{2 \times (TP \times TN - FP \times FN)}{(TP + FP) \times (FN + TN) + (TP + FN) \times (TN + FP)}
$$


In [2]:
def calculate_metrics(y_true, y_pred):
    #Confusion matrix to get TP,TN,FP, and FN
    cm = confusion_matrix(y_true, y_pred)
    #print(f'{len(cm)} x {len(cm[0])}') this was just to check shape
    
    #print(cm.shape)
    TP, FN, FP, TN = cm.flatten() if cm.shape == (2,2) else (0,0,0,0)  

    # True Positive Rate
    TPR = TP / (TP + FN) if (TP + FN) != 0 else 0  
    # True Negative Rate
    TNR = TN / (TN + FP) if (TN + FP) != 0 else 0  
    # False Positive Rate
    FPR = FP / (TN + FP) if (TN + FP) != 0 else 0  
    # False Negative Rate
    FNR = FN / (TP + FN) if (TP + FN) != 0 else 0  
    # Precision
    Precision = TP / (TP + FP) if (TP + FP) != 0 else 0 
    # F1 Measure
    F1 = (2 * Precision * TPR) / (Precision + TPR) if (Precision + TPR) != 0 else 0
    # Accuracy
    Accuracy = (TP + TN) / (TP + FP + FN + TN) if (TP + FP + FN + TN) != 0 else 0
    # Balanced Accuracy (BACC)
    Balanced_Accuracy = (TPR + TNR) / 2  
    # Error Rate
    ErrorRate = (FP + FN) / (TP + FP + FN + TN) if (TP + FP + FN + TN) != 0 else 0  
    # True Skill Statistic
    TSS = TPR - FPR  
    # Heidke Skill Score
    HSS = (2 * (TP * TN - FP * FN)) / ((TP + FP) * (FN + TN) + (TP + FN) * (TN + FP)) if (TP + FP) * (FN + TN) + (TP + FN) * (TN + FP) != 0 else 0 
    #Total
    T=TP+FN+TN+FP
    #Total Positive
    P = TP+FN
    #Total Negative
    N =TN+FP
    
    return TP, TN, FP, FN, FPR, FNR, TSS, HSS, Precision, F1, Accuracy, Balanced_Accuracy, ErrorRate, T , P , N


# **Diabetes Prediction: Machine Learning Model Comparison**

## **Objective**
This notebook compares three different machine learning models for classifying diabetes outcomes:

- **Random Forest Classifier**
- **Gaussian Naïve Bayes**
- **Long Short-Term Memory (LSTM) Neural Network**

## **Workflow**

### **1. Data Preprocessing**
- Load the dataset (`diabetes.csv`).
- Separate features (X) and the target variable (y).
- Check class distribution to assess data skewness.
- Standardize features for better model performance.
- Reshape data for LSTM (3D input format).

### **2. Model Training & Evaluation**
- Use **10-Fold Cross-Validation** for robust model evaluation.
- Train and test each model using different dataset splits.

### **3. Performance Metrics Calculation**
- The function **`calculate_metrics(y_true, y_pred)`** is used to compute:
  - True Positives (TP), True Negatives (TN), False Positives (FP), False Negatives (FN)
  - Accuracy, Precision, Recall, F1-score, False Positive Rate (FPR), False Negative Rate (FNR)
  - True Skill Statistic (TSS), Heidke Skill Score (HSS), and Error Rate.
- This function is **called three times** in the loop for each fold:
  - **For Random Forest:** `metrics_rf = calculate_metrics(y_test, y_pred_rf)`
  - **For Gaussian Naïve Bayes:** `metrics_gnb = calculate_metrics(y_test, y_pred_gnb)`
  - **For LSTM:** `metrics_lstm = calculate_metrics(y_test, y_pred_lstm)`

### **4. Results & Comparison**
- Display results for each fold in a tabular format.
- Compute **average metrics across all folds**.
- Compare models to determine the best-performing approach.

## **Conclusion**
This notebook evaluates machine learning models for diabetes prediction, providing insights into their strengths and weaknesses. The final results help in selecting the most effective model based on accuracy and other performance metrics.

# *key detail*

Since my diabetes dataset does not have timestamps or sequential dependencies, the LSTM isn't truly leveraging its full potential for capturing temporal relationships. However, it still functions as a neural network, just in a slightly unconventional way it is treated as if there is 1 time step for each sample.


In [3]:
data = pd.read_csv('diabetes.csv')

#Separate features (X) and target variable (y)
#(Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age) are the features
X = data.drop(columns=['Outcome'])

y = data['Outcome']  
#Show class distribution/ Data skewing for target outcomes (obviously the data is a bit skewed)
print(y.value_counts(normalize=True))

#Standardize the features (for better model performance)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

#Reshape data for LSTM: We need a 3D array (samples, time steps, features)
X_scaled_lstm = X_scaled.reshape((X_scaled.shape[0], 1, X_scaled.shape[1]))

#model setup
rf_model = RandomForestClassifier(n_estimators=100, random_state=420)
gnb_model = GaussianNB()
def create_lstm_model(input_shape):
    model = Sequential()
    model.add(Input(shape=input_shape))  
    model.add(LSTM(64, activation='relu'))
    model.add(Dense(1, activation='sigmoid')) 
    model.compile(optimizer=Adam(), loss='binary_crossentropy', metrics=['accuracy'])
    return model
lstm_model = create_lstm_model((X_scaled_lstm.shape[1], X_scaled_lstm.shape[2]))

#Set up 10-fold cross-validation
kf = KFold(n_splits=10, shuffle=True, random_state=420)


metrics_rf_list = []
metrics_gnb_list = []
metrics_lstm_list = []
fold_dfs = []
columns = ["Fold", "Algorithm", "TP", "TN", "FP", "FN", "FPR", "FNR", "TSS", "HSS", 
           "Precision", "F1", "Accuracy", "Balanced_Accuracy", "ErrorRate", "T", "P", "N"]

#Perform KFold cross-validation and calculate metrics for each fold for all models
for fold, (train_index, test_index) in enumerate(kf.split(X_scaled), start=1):
    X_train, X_test = X_scaled[train_index], X_scaled[test_index]
    y_train, y_test = y.iloc[train_index], y.iloc[test_index]
    
    
    #------------------------------------------------------------------------------------
    
    #Random Forest
    rf_model.fit(X_train, y_train)
    y_pred_rf = rf_model.predict(X_test)
    metrics_rf = calculate_metrics(y_test, y_pred_rf)
    metrics_entry_rf = [fold, "RandomForest", *metrics_rf]
    metrics_rf_list.append(metrics_entry_rf)

    #------------------------------------------------------------------------------------
    
    #Gaussian Naive Bayes
    gnb_model.fit(X_train, y_train)
    y_pred_gnb = gnb_model.predict(X_test)
    metrics_gnb = calculate_metrics(y_test, y_pred_gnb)
    metrics_entry_gnb = [fold, "GaussianNB", *metrics_gnb]
    metrics_gnb_list.append(metrics_entry_gnb)
    #------------------------------------------------------------------------------------
    
    #LSTM 
    X_train_lstm, X_test_lstm = X_scaled_lstm[train_index], X_scaled_lstm[test_index]#Prepare data for LSTM
    lstm_model.fit(X_train_lstm, y_train, epochs=5, batch_size=32, verbose=0)
    y_pred_lstm = (lstm_model.predict(X_test_lstm) > 0.5).astype(int).flatten()
    metrics_lstm = calculate_metrics(y_test, y_pred_lstm)
    metrics_entry_lstm = [fold, "LSTM", *metrics_lstm]
    metrics_lstm_list.append(metrics_entry_lstm)
    #------------------------------------------------------------------------------------

    
    fold_df = pd.DataFrame([metrics_entry_rf, metrics_entry_gnb, metrics_entry_lstm], columns=columns)
    fold_dfs.append(fold_df)

#PRINTING tabular format listing all details for easier visualization for each fold and average
for fold, fold_df in enumerate(fold_dfs, start=1):
    print(f"\nMetrics for Fold {fold}:")
    print(fold_df.to_string(index=False))

df_rf = pd.DataFrame(metrics_rf_list, columns=columns)
df_gnb = pd.DataFrame(metrics_gnb_list, columns=columns)
df_lstm = pd.DataFrame(metrics_lstm_list, columns=columns)

print("\nMetrics Across All Folds for Random Forest:")
print(df_rf.to_string(index=False))
print("\nMetrics Across All Folds for Gaussian Naive Bayes:")
print(df_gnb.to_string(index=False))
print("\nMetrics Across All Folds for LSTM:")
print(df_lstm.to_string(index=False))

df_rf_no_fold = df_rf.drop(columns=['Fold'])
df_gnb_no_fold = df_gnb.drop(columns=['Fold'])
df_lstm_no_fold = df_lstm.drop(columns=['Fold'])
average_metrics_rf = df_rf_no_fold.mean(numeric_only=True)
average_metrics_gnb = df_gnb_no_fold.mean(numeric_only=True)
average_metrics_lstm = df_lstm_no_fold.mean(numeric_only=True)

print("\nAverage Metrics Across All Folds for Random Forest:")
print(average_metrics_rf)
print("\nAverage Metrics Across All Folds for Gaussian Naive Bayes:")
print(average_metrics_gnb)
print("\nAverage Metrics Across All Folds for LSTM:")
print(average_metrics_lstm)



Outcome
0    0.651042
1    0.348958
Name: proportion, dtype: float64

Metrics for Fold 1:
 Fold    Algorithm  TP  TN  FP  FN      FPR      FNR      TSS      HSS  Precision       F1  Accuracy  Balanced_Accuracy  ErrorRate  T  P  N
    1 RandomForest  43  13  17   4 0.566667 0.085106 0.348227 0.404115   0.716667 0.803738  0.727273           0.674113   0.272727 77 47 30
    1   GaussianNB  43  17  13   4 0.433333 0.085106 0.481560 0.525135   0.767857 0.834951  0.779221           0.740780   0.220779 77 47 30
    1         LSTM  45  15  15   2 0.500000 0.042553 0.457447 0.530864   0.750000 0.841121  0.779221           0.728723   0.220779 77 47 30

Metrics for Fold 2:
 Fold    Algorithm  TP  TN  FP  FN  FPR      FNR      TSS      HSS  Precision       F1  Accuracy  Balanced_Accuracy  ErrorRate  T  P  N
    2 RandomForest  45  16   9   7 0.36 0.134615 0.505385 0.516916   0.833333 0.849057  0.792208           0.752692   0.207792 77 52 25
    2   GaussianNB  44  15  10   8 0.40 0.153846 0.446154

# **Which algorithm performs better and why?**

To understand which algorithm performed better, we need to analyze the results.

## LSTM (Long Short-Term Memory)

- **Pros:**
  - LSTM leads with the highest **F1 score** (0.8384), **Accuracy** (0.780), and **Balanced Accuracy** (0.7403). This suggests that LSTM provides the best balance between precision and recall, achieving the highest overall accuracy.
  - It has the **lowest Error Rate** (0.2200), indicating that it is the most accurate model in terms of minimizing misclassifications.
  - LSTM has the **lowest False Negative Rate (FNR)** (0.114), suggesting it is better at avoiding false negatives compared to other models. For instance, Random Forest has an FNR of 0.148, and GNB (Gaussian Naive Bayes) has an FNR of 0.158.
  - It also has the **highest True Skill Statistic (TSS)** (0.4807) and **Heidke Skill Score (HSS)** (0.5047), reflecting the best overall model performance in terms of distinguishing between classes.

## Random Forest

- **Pros:**
  - Random Forest has the **lowest False Positive Rate (FPR)** (0.388), meaning it is less likely to incorrectly classify a negative instance as positive.
# **Conclusion**
Based on the analysis, **LSTM** is the best-performing model overall, with superior metrics in **Accuracy**, **Balanced Accuracy**, and **Error Rate**. It excels at minimizing misclassifications and false negatives, as well as distinguishing between classes, making it the most reliable model. **Random Forest** performs well in terms of the **False Positive Rate** but falls short in other areas like **FNR** and **Accuracy**. **GNB**, while useful in certain cases, has the highest **FNR** and lower overall performance, placing it in the third position.
# **Discussion**
In general, I was expecting Random Forest to perform better than LSTM because my data isn't sequential, so it doesn't benefit from Long Short-Term Memory in that regard. Naive Bayes doesn't handle non-linear data well, and due to the high dimensionality of this dataset, I expected it to be the worst-performing algorithm. I believe the way I organized the LSTM, with 1 time step for each sample, is partially the reason it performs better than Random Forest. It kind of works like a Dense Neural Network, and due to the amount of data I'm inputting, it can outperform Random Forest.