# 🩺 Diabetes Classification Using TensorFlow

- This is a code example to create a classification model using TensorFlow to determine if a person has diabetes or not.

- The notebook contains the explanation of each line and why to do it.

- The model will be trained based on a dataset that contains information about different characteristics of people.

# 📚 Imports

In [7]:
import pandas as pd
import numpy as np

from autoviz.classify_method import data_cleaning_suggestions
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.model_selection import train_test_split

import tensorflow as tf
from tensorflow.keras import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.layers import Dropout

- **pandas** and **numpy** are popular libraries for data manipulation and analysis


- **autoviz.classify_method** is used to provide suggestions for data cleaning, such as handling missing values and encoding labels


- **sklearn.preprocessing.LabelEncoder** is used to encode categorical labels into numeric values


- **sklearn.preprocessing.StandardScaler** is used to standardize dataset resources


- **sklearn.model_selection.train_test_split** is used to split the dataset into training and test sets


- **tensorflow** is a popular framework for building and training machine learning models


- **tensorflow.keras.Sequential** is a sequential model where the layers are linearly stacked


- **tensorflow.keras.layers.Dense** defines a dense (fully connected) layer of the neural network


- **tensorflow.keras.layers.Dropout** is used to add dropout layers to avoid overfitting

---

# 📁 Loading the dataset

In [8]:
# Loading the dataset
df = pd.read_csv('data/diabetes_prediction_dataset.csv')

# Showing 5 rows from the dataset
df.head()

Unnamed: 0,gender,age,hypertension,heart_disease,smoking_history,bmi,HbA1c_level,blood_glucose_level,diabetes
0,Female,80.0,0,1,never,25.19,6.6,140,0
1,Female,54.0,0,0,No Info,27.32,6.6,80,0
2,Male,28.0,0,0,never,27.32,5.7,158,0
3,Female,36.0,0,0,current,23.45,5.0,155,0
4,Male,76.0,1,1,current,20.14,4.8,155,0


- The diabetes dataset is read from the 'data/diabetes_prediction_dataset.csv' file using the pd.read_csv function;
df.head() is used to view the first rows of the dataset.

---

# 🧮 Descriptive Statistics

In [9]:
df.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
age,100000.0,41.885856,22.51684,0.08,24.0,43.0,60.0,80.0
hypertension,100000.0,0.07485,0.26315,0.0,0.0,0.0,0.0,1.0
heart_disease,100000.0,0.03942,0.194593,0.0,0.0,0.0,0.0,1.0
bmi,100000.0,27.320767,6.636783,10.01,23.63,27.32,29.58,95.69
HbA1c_level,100000.0,5.527507,1.070672,3.5,4.8,5.8,6.2,9.0
blood_glucose_level,100000.0,138.05806,40.708136,80.0,100.0,140.0,159.0,300.0
diabetes,100000.0,0.085,0.278883,0.0,0.0,0.0,0.0,1.0


- **df.describe().T** calculates descriptive statistics for each column of the dataset, transposing the result to display the statistics in tabular format.

---

# 🧹 Data cleaning suggestions

In [10]:
# Data cleaning suggestions
data_cleaning_suggestions(df)

Data cleaning improvement suggestions. Complete them before proceeding to ML modeling.


Unnamed: 0,Nuniques,dtype,Nulls,Nullpercent,NuniquePercent,Value counts Min,Data cleaning improvement suggestions
bmi,4247,float64,0,0.0,4.247,0,
age,102,float64,0,0.0,0.102,0,
HbA1c_level,18,float64,0,0.0,0.018,0,
blood_glucose_level,18,int64,0,0.0,0.018,0,
smoking_history,6,object,0,0.0,0.006,4004,
gender,3,object,0,0.0,0.003,18,
hypertension,2,int64,0,0.0,0.002,0,
heart_disease,2,int64,0,0.0,0.002,0,
diabetes,2,int64,0,0.0,0.002,0,


- **data_cleaning_suggestions(df)** is a function that provides data cleaning suggestions based on dataset characteristics. This function probably uses techniques such as identifying missing values, removing duplicates, handling outliers, encoding labels, among others.

---

# 🧬 Convert categorical columns to numeric

In [11]:
le = LabelEncoder()

list_str = ['gender', 'smoking_history']
for c in list_str:
    df[c] = le.fit_transform(df[c])

- In this part of the code, we are using the LabelEncoder to transform the categorical columns into numerical ones. LabelEncoder is a class in the sklearn.preprocessing module that transforms categorical labels into numbers. This transformation is necessary because many machine learning algorithms only work with numerical data. In the example, we are turning the 'gender' and 'smoking_history' columns of the df dataframe into numerical values.

---

# ✂️ Splitting the data into training and testing sets

In [12]:
X = df.drop('diabetes', axis = 1)
y = df['diabetes']

xtrain, xtest, ytrain, ytest = train_test_split(X, y, test_size = 0.2, random_state = 0)

- Here, we are splitting the data into training and testing sets using the train_test_split function of the sklearn.model_selection module. This function splits the data at a specific ratio (in this case, 80% for training and 20% for testing) and ensures that the split is done randomly, using the value of random_state to control randomness.

- Set X contains all the columns of the dataframe, except the 'diabetes' column, which is the target variable we want to predict. Set y contains only the 'diabetes' column. The sets xtrain and ytrain are used to train the model, while xtest and ytest are used to evaluate the performance of the model.

---

# 📏 Standardizing the data

In [13]:
scaler = StandardScaler()
xtrain = scaler.fit_transform(xtrain)
xtest = scaler.transform(xtest)

- Here, we are using the StandardScaler from the sklearn.preprocessing module to standardize numerical data. Patterning is a common technique when preparing data for training machine learning models. It transforms the data so that the mean is 0 and the standard deviation is 1, ensuring that all features have the same scale. This is important because many machine learning algorithms are sensitive to the scale of the data.

- First, we create an instance of StandardScaler called scaler. We then use the fit_transform method to compute the standardization statistics (mean and standard deviation) from the xtrain training set, and then apply the transform to the training and test sets using the transform method. This ensures that the same standardization is applied to both sets, using the statistics computed on the training set.

---

# ⚙️ Create the model

In [14]:
model = Sequential([
    Dense(32, activation = 'relu', input_shape = (xtrain.shape[1],)),
    Dropout(0.1),
    Dense(32, activation = 'relu'),
    Dropout(0.5),
    Dense(1, activation = 'sigmoid')
])

In this piece of code, we are creating a neural network model using TensorFlow. The model is defined as a sequence of stacked layers. Here is an explanation of each part:

- **Dense(32, activation='relu', input_shape=(xtrain.shape[1],)):** This line creates a dense layer with 32 units (neurons) and ReLU activation function. The layer receives as input a shape tensor (xtrain.shape[1],), which corresponds to the format of the input data of the training set. This layer is the first layer of the model, so we specify the input format.


- **Dropout(0.1):** This line adds a dropout layer with a rate of 0.1. Dropout is a regularization technique that helps prevent overfitting by randomly deactivating a fraction of neurons during training.


- **Dense(32, activation='relu'):** This line creates another dense layer with 32 units and ReLU activation function. This is the second layer of the model, no need to specify the input format as the output from the previous layer is used as input.


- **Dropout(0.5):** This line adds a second dropout layer with a rate of 0.5.


- **Dense(1, activation='sigmoid'):** This line creates the output layer of the model with a single neuron and sigmoid activation function. This layer is responsible for producing the binary output of the model (0 or 1), indicating the target class.

---

# 🧾 Model compilation and Summary

In [15]:
model.compile(loss = 'binary_crossentropy', optimizer = 'adam', metrics = ['accuracy'])
model.summary()

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 dense (Dense)               (None, 32)                288       
                                                                 
 dropout (Dropout)           (None, 32)                0         
                                                                 
 dense_1 (Dense)             (None, 32)                1056      
                                                                 
 dropout_1 (Dropout)         (None, 32)                0         
                                                                 
 dense_2 (Dense)             (None, 1)                 33        
                                                                 
Total params: 1,377
Trainable params: 1,377
Non-trainable params: 0
_________________________________________________________________


After creating the model, we need to compile it before training it. On the first line, we are setting the model build options:

- **loss='binary_crossentropy':** We use the binary cross entropy as the loss function. This loss function is suitable for binary classification problems, where we are trying to predict one of two classes.


- **optimizer='adam':** The Adam optimizer will be used to adjust model weights during training. Adam is a popular optimization algorithm that relies on stochastic gradient descent methods.


- **metrics=['accuracy']:** In addition to the loss function, we also want to track the accuracy metric during model training and evaluation. Accuracy is a common measure for evaluating classification model performance.

On the second line, we are printing a model summary, which displays the architecture of the neural network in tabular form. The summary includes information about the input and output format of each layer, the total number of trainable parameters, and the overall model summary.

---

# 🏋️ Training the model

In [16]:
model.fit(xtrain, ytrain, epochs = 20, batch_size = 16, validation_data = (xtest, ytest))

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


<keras.callbacks.History at 0x2101cae17f0>

In this part of the code, we are training the neural network model. Here is an explanation of the different parts:

- **xtrain** and **ytrain** are the training data, where xtrain contains the resources (inputs) and ytrain contains the corresponding labels (outputs). This data is used to adjust model weights during training.


- **epochs** is the number of times the model will go through the entire training set. Each epoch consists of a cycle of going through the training data and adjusting the model weights.


- **batch_size** is the number of training examples used in a single iteration. The training set is divided into smaller batches and adjustment of model weights is performed after each batch.


- **validation_data = (xtest, ytest)** specifies the validation data to be used during training. This data is used to evaluate the model's performance on an independent dataset during training. xtest are the test resources and ytest are the corresponding labels.

---

# 📋 Model Results

In [17]:
loss, accuracy = model.evaluate(xtest, ytest)
print(f'Test loss: {loss:.2f}')
print(f'Test accuracy: {accuracy:.2f}')

Test loss: 0.08
Test accuracy: 0.97


In this part of the code, we are evaluating the performance of the trained model using the test data. Here is an explanation of the different parts:

- **model.evaluate(xtest, ytest)** calculates the loss and accuracy of the model in relation to the test data. Loss is a measure of how well the model is performing the task, while accuracy is the proportion of test examples correctly classified by the model.


- **loss** is the loss calculated by the model on the test data.


- **accuracy** is the accuracy calculated by the model on the test data.


- **print(f'Test loss: {loss}')** prints the loss calculated during the evaluation of the test data.


- **print(f'Test accuracy: {accuracy}')** prints the accuracy calculated during the evaluation of the test data.

This information is useful for understanding the performance of the trained model and evaluating its ability to generalize to previously unseen data.

---

# 😁 Thank you! Feel free to criticize! 👋🏼

---