# Diabetes Dataset(Binary Classification)

## Initial Imports & Loading Datasets

In [None]:
# Importing packages

import numpy as np
import pandas as pd
import tensorflow as tf
import sklearn
import imblearn
import matplotlib.pyplot as plt
from scipy.stats.mstats import winsorize
from sklearn.preprocessing import StandardScaler, RobustScaler
from sklearn.model_selection import train_test_split
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
from imblearn.over_sampling import RandomOverSampler

In [None]:
# Load the dataset

dataset = pd.read_csv('datasets/diabetes.csv')
dataset.head()

## Visualizing Data

In [None]:
dataset.columns

## Visualizing Data

fig, axes = plt.subplots(3, 3, figsize=(15, 15))
axes = axes.flatten() 

for i, label in enumerate(dataset.columns[:-1]):
    ax = axes[i]
    ax.hist(dataset[dataset['Outcome'] == 1][label], 
            bins=15, color='blue', label='Diabetes', 
            density=True, alpha=0.5)
    ax.hist(dataset[dataset['Outcome'] == 0][label], 
            bins=15, color='red', label='No Diabetes', 
            density=True, alpha=0.5)
    ax.set_title(label)
    ax.set_ylabel("Probability")
    ax.set_xlabel(label)
    ax.legend()

axes[8].bar(['Diabetes', 'No Diabetes'], 
        [len(dataset[dataset['Outcome'] == 1]['Glucose']), 
         len(dataset[dataset['Outcome'] == 0]['Glucose'])])
axes[8].set_title('Diabetes vs No Diabetes')
axes[8].set_ylabel('Number')

plt.tight_layout()  # Adjust spacing between subplots
plt.show()



In [None]:
# Boxplots, best way to check for outliers

fig, axes = plt.subplots(3, 3, figsize=(15, 15))
axes = axes.flatten() 

for i, label in enumerate(dataset.columns[:-1]):
    ax = axes[i]
    ax.boxplot([dataset[dataset['Outcome'] == 1][label], dataset[dataset['Outcome'] == 0][label]], tick_labels=['Diabetes', 'No Diabetes'])
    ax.set_title(label)

plt.tight_layout()  # Adjust spacing between subplots
plt.show()

- There seem to be missing values here(0s) which need to be replaced. Since there are a lot of outliers,  mean will be affected but not median, hence we go with median as the replaced value. Mean works better for normal distributions and lower outlier scenarios.
- There are a few outliers for certain features, they need to be handled properly.
   - **Turkey's Rule**: Also called the Interquartile Range(IQR) rule. Here we calculate the range out of which the data points can be called outliers.
         IQF = (Q3 - Q1)
         Lower Boundary = (Q1 - 1.5 * IQR)
         Upper Boundary = (Q3 + 1.5 * IQR). Other values instead of 1.5 like 2, 3 can also be used.
   - **Drop the outliers**: Self explanatory. Can be done if a lot of rows, but not recommended as vulnerable record may get lost. Use only when you know that the data point is an incorrect rading, or when outliers represent data that is of no need(doesn't belong to the population), like outlier is data of a child but study is of adults.
   - **Winsorize Method**: Limit Outliers by setting upper and lower limits. Especially useful for medical data where extreme but valid cases exist.
   - **Log Transformation**: Mostly used on highly right-skewed data. It reduces the skewness of datya and tries to make it normal, which neural networks like a lot. However, log transformation may make the data more skewed in some cases. So be carefulwhen using it.
- Divide the data into 3 sets - training, validation, test.
- Also the dataset seems to imbalanced with there being much more cases of No diabetes than Diabetes. Oversampling needs to be done, but only on the training set.
- Each feature has different range. We need to scale the training data properly.

## Data Preprocessing

### Remvoing Missing Values

In [None]:
# Removing Missing Values(0s) in columns that make sense

# Replacing 0 with Median to denote missing values
missing = ['Glucose', 'BloodPressure', 'Insulin', 'BMI', 'SkinThickness']

for col in missing:
    dataset[col] = dataset[col].replace(0, dataset[col].median())

In [None]:
# Boxplots, best way to check for outliers
# To show no NaN values are left.

fig, axes = plt.subplots(3, 3, figsize=(15, 15))
axes = axes.flatten() 

for i, label in enumerate(dataset.columns[:-1]):
    ax = axes[i]
    ax.boxplot([dataset[dataset['Outcome'] == 1][label], dataset[dataset['Outcome'] == 0][label]], tick_labels=['Diabetes', 'No Diabetes'])
    ax.set_title(label)

plt.tight_layout()  # Adjust spacing between subplots
plt.show()

### Handling Outliers

Taking care of outliers:

1. Pregnancies: **Keep as is**. The dataset may be right skewed but thats normal, its expected for women to have lower number of pregnancies.
2. Glucose: There is a strong chance the outliers here show hyperglycemia in diabetic patients. We want to preserve this information without letting it have extreme influence, this can be done by **Winsorization**. Log Transform will normalize the data which will lead to loss of iunformation.
3. BloodPressure: Outliers may be measurement errors? **Winsorize** both ends.
4. SkinThickness: Right-Skewed. Use **Log Transform**.
5. Insulin: Heavily Right Skewed. **Log Transform** to be used.
6. BMI: Moderately Skewed, Use **Winsorization**.
7. DiabetesPedigreeFunction: **Winsorization**. This function contains genetic risk information where extreme values may represent legitimate hereditary factors. Mild winsorization preserves most information while reducing influence of extreme outliers
8. Age: **Keep as is**. Cant be changed.

In [None]:
# Applying Outlier Handling Techniques

# Winsorization
for label in ['Glucose', 'BloodPressure', 'BMI', 'DiabetesPedigreeFunction', 'Insulin']:
    lower = dataset[label].quantile(0.05)
    upper = dataset[label].quantile(0.95)
    dataset[label] = dataset[label].clip(lower, upper)

# Log Transform
# It won't work if the dataset has -ve values, which can be introduced after MICE Imputation
# So clip these ones

# dataset['Insulin'] = dataset['Insulin'].clip(lower=2)

for label in ['SkinThickness', ]:
    dataset[label] = np.log1p(dataset[label])

In [None]:
# Boxplots, best way to check for outliers

fig, axes = plt.subplots(3, 3, figsize=(15, 15))
axes = axes.flatten() 

for i, label in enumerate(dataset.columns[:-1]):
    ax = axes[i]
    ax.boxplot([dataset[dataset['Outcome'] == 1][label], dataset[dataset['Outcome'] == 0][label]], tick_labels=['Diabetes', 'No Diabetes'])
    ax.set_title(label)

plt.tight_layout()  # Adjust spacing between subplots
plt.show()

### Dataset Splitting

In [None]:
# Divide the dataset into features and targets

x = dataset.drop('Outcome', axis=1)
y = dataset['Outcome']

x_train, x_new, y_train, y_new = train_test_split(x, y, test_size = 0.3, random_state = 0)
x_valid, x_test, y_valid, y_test = train_test_split(x_new, y_new, test_size = 0.5, random_state = 0)

temp = pd.concat([x_train, y_train], axis = 1)
plt.bar(['Diabetes', 'No Diabetes'], 
        [len(temp[temp['Outcome'] == 1]['Glucose']), 
         len(temp[temp['Outcome'] == 0]['Glucose'])])
plt.title('Diabetes vs No Diabetes')
plt.ylabel('Number')

plt.show()

### Oversampling the training set

Oversampling/Undersampling is only done on the training data. Doing it on the other sets may cause data leakage. Thus, the data is divided first and then sampled.

In [None]:
# Oversampling the training set

sampler = RandomOverSampler(sampling_strategy = 1)
x_train, y_train = sampler.fit_resample(x_train, y_train)

y_train.value_counts

In [None]:
temp = pd.concat([x_train, y_train], axis = 1)
plt.bar(['Diabetes', 'No Diabetes'], 
        [len(temp[temp['Outcome'] == 1]['Glucose']), 
         len(temp[temp['Outcome'] == 0]['Glucose'])])
plt.title('Diabetes vs No Diabetes')
plt.ylabel('Number')

### Feature Scaling

- Standardization is a common requirement for many ML algorithms as they start behaving badly if the individual features do not more or less look like standard normally distributed data: Gaussian with zero mean and unit variance.
- In practice we often ignore the shape of the distribution and just transform the data to center it by removing the mean value of each feature, then scale it by dividing non-constant features by their standard deviation.

1. StandardScaler: x_new = (x_old - mean) / (std deviation). This method is vulnerable to outliers as mean is affected by outliers. Zero Mean and Unit Variance, perfect for Neural Network Architectures.
2. RobustScaler: x_new = (x_old - median) / (IQR). This uses median and is thus not affected by outliers. Since IQR is used as well, it absorbs the effects of outliers while scaling. If you have outliers that might affect the results and you don't want to remove them, use this.
3. MinMaxScaler: x_new = (x_old - min_value) / (max_value - min_value). It sets data from 0 to 1. Not suitable when outliers are present as max and min values are used.
4. MaxAbsScaler: x_new = x_old / |max_value|. If data has negative values, it sets data between -1 and 1. Since max value is used, it is not suitable for outliers.

We will use a StandardScaler here. **UPDATE**: Changed to RobustScaler as it gives a much better Performance boost compared StandardScaler.

In [None]:
# Feature Scaling

scaler = RobustScaler()
x_train = scaler.fit_transform(x_train)

In [None]:
x_valid = scaler.transform(x_valid)
x_test = scaler.transform(x_test)

In [None]:
print(x_train)

In [None]:
print(x_valid)

In [None]:
print(x_test)

## Model Creation and Training

In [None]:
# Model Creation

model = tf.keras.Sequential([
    tf.keras.layers.Dense(units=35, activation='relu', kernel_regularizer=tf.keras.regularizers.L2(0.01)),
    tf.keras.layers.Dense(units=25, activation='relu', kernel_regularizer=tf.keras.regularizers.L2(0.01)),
    tf.keras.layers.Dense(units=15, activation='relu', kernel_regularizer=tf.keras.regularizers.L2(0.01)),
    tf.keras.layers.Dense(units=1, activation='sigmoid')
])

In [None]:
model.compile(optimizer=tf.keras.optimizers.Adam(learning_rate=0.001),
             loss = tf.keras.losses.BinaryCrossentropy(),
             metrics=['accuracy'])

In [None]:
# Evaluate initial performance without any training

model.evaluate(x_train, y_train)

### Important Points

- **Iterations**: Number of batches needed to complete one Epoch.
- **Batch Size**: Number of training samples used in one Iteration.
- **Epoch**: One full cycle through the entire dataset.
- Number of steps per Epoch = Number of training examples / Batch Size
- After every Iteration, weights are re-evaluated and updated. Batching is practical for efficient computation.

Key Considerations to Keep in mind:

- Too Many Epochs can lead to overfitting.
- Smaller batch sizes introduce more noise but allow for more frequent updates. Larger batch sizes may require more memory.

In [None]:
# Train the model

model.fit(x_train, y_train, batch_size=64, epochs=25, validation_data=(x_valid, y_valid))

In [None]:
model.evaluate(x_test, y_test)


## Interim Results

- We have acheived accuracy of 84.59% on our test data using the given simple neural network.
- However, studies suggest that accuracy of upto 98% can be acheived. A simple neural network has provided a typical accuracy of 80-85% while Deep neural networks can acheive 88-98% accuracy.

Some improvements include:

1. Better Data Preprocessing
   1. Changing the Scaler from StandardScaler to Robust Scaler resulted in accuracy to boost up by 2%.
   2. Advanced Imputation Techniques for better handling of missing values. MICE helps preserve the relationship between variables better than blindly replacing with median value. 
   3. Advanced Class Balancing. Study ADASYN
2. Hyperparameter Tuning
3. More Layers(Deep Neural Network)