# Diabetes Prediction Model

This is a binary classification neural network model. The goal of this model is to be able to predict whether the patient is a diabetic (postive) or not (negative) based on the given history of the dataset.

### Aboout the Dataset
The Diabetes prediction dataset is a collection of medical and demographic data from patients, along with their diabetes status (positive or negative). The data includes features such as age, gender, body mass index (BMI), hypertension, heart disease, smoking history, HbA1c level, and blood glucose level.

### Attributes Description

##### gender
- Gender refers to the biological sex of the individual, which can have an impact on their susceptibility to diabetes. 
- 3 Categories: 
    - ***Male***
    - ***Female***
    - ***Other***
##### age
- Age is an important factor as diabetes is more commonly diagnosed in older adults.
- Age ranges from ***0-80*** in our dataset.
##### hypertension
- Hypertension is a medical condition in which the blood pressure in the arteries is persistently elevated. 
- Values:
    - 0 => no hypertension 
    - 1 => have hypertension
##### heart_disease
- Heart disease is another medical condition that is associated with an increased risk of developing diabetes. 
- Values:
    - 0 => no heart disease 
    - 1 => have heart disease
##### smoking_history
- Smoking history is also considered a risk factor for diabetes and can exacerbate the complications associated with diabetes.
- 5 Categories: 
    - ***Not Current***
    - ***Former***
    - ***No Info***
    - ***Current***
    - ***Never***
##### bmi
- BMI (Body Mass Index) is a measure of body fat based on weight and height. Higher BMI values are linked to a higher risk of diabetes. 
- The range of BMI in the dataset is from ***10.16 to 71.55***. BMI less than 18.5 is underweight, 18.5-24.9 is normal, 25-29.9 is overweight, and 30 or more is obese.
- Values:

|                      |             |
|----------------------|-------------|
| BMI < 18.5           | underweight |
| 18.5 <= BMI <= 24.9  | normal      |
| 25 <= BMI <= 29.9    | overweight  |
| BMI > 30             | obese       |

##### HbA1c_level
- HbA1c (Hemoglobin A1c) level is a measure of a person's average blood sugar level over the past 2-3 months. 
- Higher levels indicate a greater risk of developing diabetes. 
- Mostly ***more than 6.5%*** of HbA1c Level indicates diabetes.
##### blood_glucose_level
- Blood glucose level refers to the amount of glucose in the bloodstream at a given time. 
- High blood glucose levels are a key indicator of diabetes.
##### diabetes (***Output variable***)
- Diabetes is the target variable being predicted.
- Values:
    - 0 => no diabetes
    - 1 => have diabetes

### Importing the libraries

In [2]:
import numpy as np
import pandas as pd
import tensorflow as tf

### Part 1 - Data Preprocessing

importing the dataset

In [3]:
dataset = pd.read_csv('diabetes_prediction_dataset.csv')

# Print the dataset
dataset

Unnamed: 0,gender,age,hypertension,heart_disease,smoking_history,bmi,HbA1c_level,blood_glucose_level,diabetes
0,Female,80.0,0,1,never,25.19,6.6,140,0
1,Female,54.0,0,0,No Info,27.32,6.6,80,0
2,Male,28.0,0,0,never,27.32,5.7,158,0
3,Female,36.0,0,0,current,23.45,5.0,155,0
4,Male,76.0,1,1,current,20.14,4.8,155,0
...,...,...,...,...,...,...,...,...,...
99995,Female,80.0,0,0,No Info,27.32,6.2,90,0
99996,Female,2.0,0,0,No Info,17.37,6.5,100,0
99997,Male,66.0,0,0,former,27.83,5.7,155,0
99998,Female,24.0,0,0,never,35.42,4.0,100,0


#### Splitting the Dataset

Split the dataset into two parts:
- *input_varibles* - input features that are used to make the prediction
- *output_variable* - target variable to be predicted

In [4]:
# get all rows
# get the first columns till the second to the last column
# -1 denotes not including the last column
input_variables = dataset.iloc[:, 0:-1]

# get all rows
# get only the last column
output_variable = dataset.iloc[:, -1]

In [5]:
input_variables

Unnamed: 0,gender,age,hypertension,heart_disease,smoking_history,bmi,HbA1c_level,blood_glucose_level
0,Female,80.0,0,1,never,25.19,6.6,140
1,Female,54.0,0,0,No Info,27.32,6.6,80
2,Male,28.0,0,0,never,27.32,5.7,158
3,Female,36.0,0,0,current,23.45,5.0,155
4,Male,76.0,1,1,current,20.14,4.8,155
...,...,...,...,...,...,...,...,...
99995,Female,80.0,0,0,No Info,27.32,6.2,90
99996,Female,2.0,0,0,No Info,17.37,6.5,100
99997,Male,66.0,0,0,former,27.83,5.7,155
99998,Female,24.0,0,0,never,35.42,4.0,100


In [6]:
output_variable = pd.DataFrame(output_variable, columns=["diabetes"])

output_variable

Unnamed: 0,diabetes
0,0
1,0
2,0
3,0
4,0
...,...
99995,0
99996,0
99997,0
99998,0


#### Encoding Categorical Data

Since most algorithms, perform mathematical operations, it is better to have numerical inputs instead of categorical data like gender, country.



We use the *Label Encoding* for the *Gender* column because it only has 2 values: 

    0 => Female, 1 => Male. 
    
*Label Encoding* is simple and efficient, it reduces the dimensionality of the data.

In [7]:
from sklearn.preprocessing import LabelEncoder

# Initialize the label encoder
label_encoder = LabelEncoder()

# Fit and transform the gender column
input_variables['gender'] = label_encoder.fit_transform(input_variables['gender'])

input_variables


Unnamed: 0,gender,age,hypertension,heart_disease,smoking_history,bmi,HbA1c_level,blood_glucose_level
0,0,80.0,0,1,never,25.19,6.6,140
1,0,54.0,0,0,No Info,27.32,6.6,80
2,1,28.0,0,0,never,27.32,5.7,158
3,0,36.0,0,0,current,23.45,5.0,155
4,1,76.0,1,1,current,20.14,4.8,155
...,...,...,...,...,...,...,...,...
99995,0,80.0,0,0,No Info,27.32,6.2,90
99996,0,2.0,0,0,No Info,17.37,6.5,100
99997,1,66.0,0,0,former,27.83,5.7,155
99998,0,24.0,0,0,never,35.42,4.0,100


We use the *One Hot Encoder* for the *smoking_history*, as it is a nominal categorical variable

In [8]:
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer

# Initialize ColumnTransformer with OneHotEncoder
column_transformer = ColumnTransformer(transformers=[('encoder', OneHotEncoder(), [4])], remainder='passthrough')

# Fit and transform smoking_history column and convert to a DataFrame
input_variables = np.array(column_transformer.fit_transform(input_variables))

input_variables

array([[  0.  ,   0.  ,   0.  , ...,  25.19,   6.6 , 140.  ],
       [  1.  ,   0.  ,   0.  , ...,  27.32,   6.6 ,  80.  ],
       [  0.  ,   0.  ,   0.  , ...,  27.32,   5.7 , 158.  ],
       ...,
       [  0.  ,   0.  ,   0.  , ...,  27.83,   5.7 , 155.  ],
       [  0.  ,   0.  ,   0.  , ...,  35.42,   4.  , 100.  ],
       [  0.  ,   1.  ,   0.  , ...,  22.43,   6.6 ,  90.  ]])

#### Splitting the Dataset: Training set and Testing set

Purpose:
- to evaluate how well the ML model is likely to perform unseen data
- reserving a portion of your data for testing, you can assess the model's generalization performance


In [9]:
from sklearn.model_selection import train_test_split

# Split dataset into features(input) and target(output)
input, output = input_variables, np.array(output_variable)

# Split into 70(training)/30(testing)
input_train, input_test, output_train, output_test = train_test_split(input, output, test_size=0.3, random_state=0)

input_train

array([[  1.  ,   0.  ,   0.  , ...,  27.32,   4.8 , 159.  ],
       [  1.  ,   0.  ,   0.  , ...,  16.02,   5.8 ,  90.  ],
       [  0.  ,   1.  ,   0.  , ...,  27.28,   6.6 , 159.  ],
       ...,
       [  0.  ,   0.  ,   0.  , ...,  41.23,   9.  , 145.  ],
       [  0.  ,   0.  ,   0.  , ...,  30.18,   5.8 ,  90.  ],
       [  1.  ,   0.  ,   0.  , ...,  27.32,   4.5 , 158.  ]])

In [10]:
input_test

array([[  0.  ,   0.  ,   0.  , ...,  27.32,   4.8 , 140.  ],
       [  0.  ,   0.  ,   0.  , ...,  27.32,   4.8 , 100.  ],
       [  0.  ,   1.  ,   0.  , ...,  37.16,   6.6 ,  85.  ],
       ...,
       [  0.  ,   1.  ,   0.  , ...,  27.32,   6.2 , 145.  ],
       [  1.  ,   0.  ,   0.  , ...,  25.58,   5.7 , 200.  ],
       [  0.  ,   0.  ,   0.  , ...,  21.68,   5.7 , 155.  ]])

In [11]:
output_train

array([[0],
       [0],
       [0],
       ...,
       [1],
       [0],
       [0]], dtype=int64)

In [12]:
output_test

array([[0],
       [0],
       [0],
       ...,
       [0],
       [0],
       [0]], dtype=int64)

#### Feature Scaling

Feature scaling is a preprocessing technique in machine learning that transforms the numerical features of a dataset into a specific range or distribution.

In [13]:
from sklearn.preprocessing import StandardScaler

standard_scaler = StandardScaler()

input_train = standard_scaler.fit_transform(input_train)
input_test = standard_scaler.fit_transform(input_test)

array([[ 1.33776777, -0.32091669, -0.20518498, ..., -0.00383155,
        -0.67622209,  0.51531068],
       [ 1.33776777, -0.32091669, -0.20518498, ..., -1.69896895,
         0.25584604, -1.17972675],
       [-0.7475139 ,  3.11607354, -0.20518498, ..., -0.00983203,
         1.00150055,  0.51531068],
       ...,
       [-0.7475139 , -0.32091669, -0.20518498, ...,  2.0828376 ,
         3.23846406,  0.17139004],
       [-0.7475139 , -0.32091669, -0.20518498, ...,  0.42520323,
         0.25584604, -1.17972675],
       [ 1.33776777, -0.32091669, -0.20518498, ..., -0.00383155,
        -0.95584252,  0.49074492]])

#### Building the Artificial Neural Network(ANN)

In this step, we will begin developing the model.

##### General Outline of the proccess

- **Run Individual Experiments**

1. Experiment with different numbers of hidden layers and units per layer
2. Experiment with different activation functions in the hidden layers
3. Experiment with different optimizers


- **Analysis of Individual Experiments**

    Analyze how each factors affects the model performance. For each of the experiments, pick out the best configurations for the model.

- **Combining Factors for the Final Architecture of the Model**

    Combine the best-performing cofinguration from each of the categories(architecture, activation featuers, and optimizer).

- **Training the Final Model**

    Train the final nueral network model usng the training data

- **Evaluation and Validation**

    Evaluate the final model's performance on a separate validation dataset to ensure it generalizes well to unseen data.