<a href="https://colab.research.google.com/github/AshorYaghob/Diabetes_Detection_Neural_Network/blob/main/diabetes_detection_NN.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Load In Dataset
We want to load in the dataset and analyze the data.
This includes:
1. checking for nulls

2. Filling in Nulls

3. Checking the datatypes

4. Encoding Categorical Features

Dataset: https://www.kaggle.com/datasets/abdelazizsami/early-stage-diabetes-risk-prediction

In [29]:
import pandas as pd
import numpy as np

# Load the dataset
df = pd.read_csv('diabetes_data_upload.csv', sep=',')
df.replace('?', np.nan, inplace=True)
df.replace(' ', np.nan, inplace=True)
print(df.isnull().sum())

Age                   0
Gender                0
Polyuria              0
Polydipsia            0
sudden weight loss    0
weakness              0
Polyphagia            0
Genital thrush        0
visual blurring       0
Itching               0
Irritability          0
delayed healing       0
partial paresis       0
muscle stiffness      0
Alopecia              0
Obesity               0
class                 0
dtype: int64


In [30]:
df.dtypes

Unnamed: 0,0
Age,int64
Gender,object
Polyuria,object
Polydipsia,object
sudden weight loss,object
weakness,object
Polyphagia,object
Genital thrush,object
visual blurring,object
Itching,object


In [31]:
# No Apparant Irrelevant attributes
# Keeping code incase someone wants to play around with it
irrelevant_attributes = []
df.drop(irrelevant_attributes, axis=1, inplace=True)
print(df.isnull().sum())

Age                   0
Gender                0
Polyuria              0
Polydipsia            0
sudden weight loss    0
weakness              0
Polyphagia            0
Genital thrush        0
visual blurring       0
Itching               0
Irritability          0
delayed healing       0
partial paresis       0
muscle stiffness      0
Alopecia              0
Obesity               0
class                 0
dtype: int64


In [32]:
categorical_features = df.select_dtypes(include=['object']).columns.tolist()
print(categorical_features)
numerical_features = ['Age']
print(numerical_features)

['Gender', 'Polyuria', 'Polydipsia', 'sudden weight loss', 'weakness', 'Polyphagia', 'Genital thrush', 'visual blurring', 'Itching', 'Irritability', 'delayed healing', 'partial paresis', 'muscle stiffness', 'Alopecia', 'Obesity', 'class']
['Age']


In [33]:
from sklearn.preprocessing import StandardScaler, OneHotEncoder

df_encoded = df.copy()
# Make sure the encoder reurns a numpy array and not a matrix, also drop extra category to avoid redundancy
encoder = OneHotEncoder(sparse_output=False, drop='first')

for col in categorical_features:
    # Encode entire column. transformed_data is the new column(s) containing the encoded data
    transformed_data = encoder.fit_transform(df_encoded[[col]])
    # encoder.get_feature_names_out([col]) generates the column names for the encoded df
    # Then it convertes transformed_data into a dataframe with those corresponding column names
    encoded_df = pd.DataFrame(transformed_data, columns=encoder.get_feature_names_out([col]))
    # Drops the orginal column from the df, then concatenates the new dataframe(s), axis=1 ensures column wise concatenation
    df_encoded = pd.concat([df_encoded.drop(columns=[col]), encoded_df], axis=1)

df_encoded.head()

Unnamed: 0,Age,Gender_Male,Polyuria_Yes,Polydipsia_Yes,sudden weight loss_Yes,weakness_Yes,Polyphagia_Yes,Genital thrush_Yes,visual blurring_Yes,Itching_Yes,Irritability_Yes,delayed healing_Yes,partial paresis_Yes,muscle stiffness_Yes,Alopecia_Yes,Obesity_Yes,class_Positive
0,40,1.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,1.0,1.0,1.0,1.0
1,58,1.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,1.0
2,41,1.0,1.0,0.0,0.0,1.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0,1.0,1.0,0.0,1.0
3,45,1.0,0.0,0.0,1.0,1.0,1.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0
4,60,1.0,1.0,1.0,1.0,1.0,1.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


In [34]:
# Double check the data types to ensure they are all numerical
df_encoded.dtypes

Unnamed: 0,0
Age,int64
Gender_Male,float64
Polyuria_Yes,float64
Polydipsia_Yes,float64
sudden weight loss_Yes,float64
weakness_Yes,float64
Polyphagia_Yes,float64
Genital thrush_Yes,float64
visual blurring_Yes,float64
Itching_Yes,float64


## Split Data

Now we want to split the data into test and training sets. Validation sets can be done too, but for this code we are just doing training and test sets.

In [35]:
from sklearn.model_selection import train_test_split
# Training set has 80 percent of the data, test has 20
train_set, test_set = train_test_split(df_encoded, test_size=0.2)
train_set = train_set.reset_index(drop=True)
test_set = test_set.reset_index(drop=True)
train_set.shape, test_set.shape

((416, 17), (104, 17))

In [36]:
from typing import Tuple
import torch

# Takes a pandas DF and converts it into a tuple of Tensors, containing the feature tensor and the label Tensor.
def create_dataset(data: pd.DataFrame) -> Tuple[torch.Tensor, torch.Tensor]:
    features = torch.tensor(data.drop(columns=['class_Positive']).to_numpy(), dtype=torch.float)
    labels = torch.tensor(data['class_Positive'].to_numpy(), dtype=torch.float)
    return features, labels

In [37]:
features_train, labels_train = create_dataset(train_set)
features_test, labels_test = create_dataset(test_set)
features_train.shape, labels_train.shape, features_test.shape, labels_test.shape

(torch.Size([416, 16]),
 torch.Size([416]),
 torch.Size([104, 16]),
 torch.Size([104]))

In [38]:
#unsqueeze the labels tensors so their shape is the same as the features tensor.
# instead of [1,0,0,1,0 ...] its [[1],[0],[0],[1],[0], ...]
labels_train = labels_train.unsqueeze(1)
labels_test = labels_test.unsqueeze(1)
labels_train.shape, labels_test.shape

(torch.Size([416, 1]), torch.Size([104, 1]))

## Training the Model

Now we set up the paramaters to train the Model.

We create the actual class for the NN itself.

We then Loop through epochs of training until the loss is low or the loop finishes.

In [39]:
input_size = features_train.shape[1]
n_classes = 1
hidden_channels = input_size

In [40]:
import torch.nn.functional as F #for activation function
from torch import Tensor
import torch.nn as nn

class NN(torch.nn.Module):
    def __init__(self, input_size: int, n_classes: int, hidden_channels: int):
        # inheret the super class
        super().__init__()
        # create 2 dense layers.
        self.linear_layer_1 = torch.nn.Linear(input_size, hidden_channels)
        self.linear_layer_2 = torch.nn.Linear(hidden_channels, n_classes)
    def forward(self, features: torch.Tensor) -> torch.Tensor:
      #use the features to pass every instance through the first layer
      x = F.relu(self.linear_layer_1(features))
      # use the outputs of the first layer as the input of the second layer
      x = (self.linear_layer_2(x))
      # final layer(matrix of size [1, 1])
      # it will return the final layer for every input. For ex:
      # if training data has 1000 instances, shape will be [1000, 1]
      return x

In [41]:
criterion = torch.nn.BCEWithLogitsLoss()
model = NN(input_size, n_classes, hidden_channels)
optimizer = torch.optim.Adam(model.parameters(), lr=0.01)
for epoch in range(200):
    # zero the gradients
    optimizer.zero_grad()
    # Calls forward method of NN and gets a tensor of predictions
    pred = model(features_train)
    # calculates the loss
    loss = criterion(pred, labels_train)
    # calculates the gradients w.r.t the bias and weight matrix(the learnable paramaters in this case)
    loss.backward()
    # updates the weights and bias using the adam optimizer
    optimizer.step()
    print(f'Epoch: {epoch},\nTrain Loss: {loss.item()}\n')
    # Calculates how this epoch of the model works on the test set, for comparison purposes
    with torch.no_grad():
      pred_test = model(features_test)
      loss_test = criterion(pred_test, labels_test)
    print(f'Test Loss: {loss_test.item()}\n')
    # Incase our loss gets too small we break
    if loss.item() < 0.1:
        break

Epoch: 0,
Train Loss: 2.7306151390075684

Test Loss: 1.90442955493927

Epoch: 1,
Train Loss: 1.9693628549575806

Test Loss: 1.3277556896209717

Epoch: 2,
Train Loss: 1.368752121925354

Test Loss: 0.9121145606040955

Epoch: 3,
Train Loss: 0.9317986369132996

Test Loss: 0.7034115195274353

Epoch: 4,
Train Loss: 0.7043991088867188

Test Loss: 0.6923841238021851

Epoch: 5,
Train Loss: 0.6789941787719727

Test Loss: 0.7765915393829346

Epoch: 6,
Train Loss: 0.7529303431510925

Test Loss: 0.8679068088531494

Epoch: 7,
Train Loss: 0.8382776975631714

Test Loss: 0.9330994486808777

Epoch: 8,
Train Loss: 0.9000421762466431

Test Loss: 0.9665456414222717

Epoch: 9,
Train Loss: 0.9320188164710999

Test Loss: 0.9712031483650208

Epoch: 10,
Train Loss: 0.9365971088409424

Test Loss: 0.9523171186447144

Epoch: 11,
Train Loss: 0.9188035726547241

Test Loss: 0.9164340496063232

Epoch: 12,
Train Loss: 0.8848500847816467

Test Loss: 0.8695341944694519

Epoch: 13,
Train Loss: 0.8405911326408386

Test Los

## Calculate ROC AUC and F1 score

We calculate the ROC AUC and f1 score to see how well our model works on unseen data.

In [42]:
# Save the predictions on the test set in a new variable
with torch.no_grad():
  test_pred = model(features_test)
  loss = criterion(test_pred, labels_test)
  print(f'Test Loss: {loss.item()}')


Test Loss: 0.22364817559719086


In [43]:
from sklearn.metrics import roc_auc_score, f1_score

# Convert the predictions  to numpy arrays then apply the sigmoid function to them
pred_probs = torch.sigmoid(test_pred).cpu().numpy()

#Probablility > 0.5 = classify as positive
pred_labels = (pred_probs > 0.5).astype(int)

# Convert the labels to numpy arrays
labels_test_np = labels_test.cpu().numpy()

# Calculate the ROC Area Under Curve score
roc_auc = roc_auc_score(labels_test_np, pred_probs)
print(f'ROC-AUC Score: {roc_auc}')

# Calculate the F1 score
f1 = f1_score(labels_test_np, pred_labels)
print(f'F1 Score: {f1}')

ROC-AUC Score: 0.9566395663956639
F1 Score: 0.9105691056910569
