# Diabetes Prediction
#### Data Source
From Kaggle by user Mohammed Mustafa 
Link https://www.kaggle.com/datasets/iammustafatz/diabetes-prediction-dataset
### Description
The data contains predictors `gender`, `age`, `hypertension`, `heart_disease`, `smoking_history`, `bmi`, `HbA1c_level`, `blood_glucose_level`, and the response `diabetes` for 100000 patients.
- `gender` may have one of three qualitative values `female`, `male`, or `other`
- `age` may be any integer value from 0-80
- `hypertension` is one-hot encoded where 0 and 1 are the absense and presence of hypertension exhibited by the patient, respectively
- `heart_disease` is one-hot encoded where 0 and 1 are the absense and presence of heart disease exhibited by the patient, respectively
- `smoking_history` may be one of six qualitative values, `not current`, `former`, `No Info`, `current`, `never`, and `ever`
- `bmi` is the body mass index of the patient and may be any real number between 10 and 95.7
- `HbA1c_level` is the level of hemoglobic A1c measured in the pateints blood and may be a real number from 3.5 to 9
- `blood_glucose_level` is the level of glucose in the patient's bloodstream and may be a real nmumber from 80 to 300
- `diabetes` is one-hot encoded where 0 and 1 are the absense and presence of diabetes exhibited by the patient, respectively

### Training and Test Split
The data will be randomly split into training and test data. 10% of the data will be removed exclusively for model testing and analysis toward the end of the experiment. The remaining 90% of data will be used for training purposes. 

## Centralized and Federated Environments
### Centralized
In this scenario, all data is stored in one location and training using a single logistic regression model. This model will be analyzed and provide a baseline against which the federated learning tests will be compared.

### Federated
In this scenario, the training data is randomly distributed into 8 different smaller sets which will be used to train 8 invididual logistic regression models concurrently. This will simulate 8 different clients training on their own local data. Initially, the bins will contain roughly the same amount of data. To explore the effects of an imbalanced distribution of data, another experiment will assess model training when this data is not equally distributed across the clients. It is important to note that between training rounds, the clients will send model parameters (in this case the model coefficients) to a central process acting as a server. This server will aggregate model updates to generate a new global model which all clients will use to update their own local model before the next round of training.


## Installing Dependencies

In [1]:
#!pip install numpy pandas sklearn

## Initial Data Processing

In [2]:
import os

# Change directory to the location holding the data
os.chdir('../data')
os.getcwd()

'C:\\Repositories\\COMP8590-StatisticalLearning\\data'

In [3]:
import random
from math import ceil
import pandas as pd
import sys

random.seed(0)
# Split the data, 90% Training, 10% Test
# Test data only used for final evaluation

# Count lines in file
data_file = open('diabetes_prediction_dataset.csv')
lines_data = sum(1 for line in data_file)

# Reset file pointer
data_file.seek(0)

# Randomly Select Train and test Samples
select_test = random.sample(range(1, lines_data), ceil(lines_data * 0.1))
select_train = [row for row in range(1, lines_data) if row not in select_test]

# Load in test data
train_df = pd.read_csv('diabetes_prediction_dataset.csv', skiprows=select_test)
test_df = pd.read_csv('diabetes_prediction_dataset.csv', skiprows=select_train)

# As an additional preprocessing task, we need to create dummy variables for the `smoking_history` predictor.
# This is necessary so that we can create 6 dummy dummy variables that will be one-hot encoded for each of the
# possible categorical values of `smoking_history`. The new dummy variables are `smoking_history_No Info`,
# `smoking_history_current`, `smoking_history_ever`, `smoking_history_former`, `smoking_history_never`, and
# `smoking_history_not current`.
smk_dummies_train = pd.get_dummies(train_df['smoking_history'], dtype=int)
train_df = train_df.drop(columns=['smoking_history'])
for column in smk_dummies_train.columns:
        train_df.insert(7, 'smoking_history_' + column, smk_dummies_train[column])

smk_dummies_test = pd.get_dummies(test_df['smoking_history'], dtype=int)
test_df = test_df.drop(columns=['smoking_history'])
for column in smk_dummies_test.columns:
        test_df.insert(7, 'smoking_history_' + column, smk_dummies_test[column])
        
# The same process must also be performed for the `gender` predictor. We will create three dummy variables, 
# `female`, `male`, and `other`, and one-hot encode them in a similar fashion to above
gnd_dummies_train = pd.get_dummies(train_df['gender'], dtype=int)
train_df = train_df.drop(columns=['gender'])
for column in gnd_dummies_train.columns:
        train_df.insert(0, 'gender_' + column, gnd_dummies_train[column])

gnd_dummies_test = pd.get_dummies(test_df['gender'], dtype=int)
test_df = test_df.drop(columns=['gender'])
for column in gnd_dummies_test.columns:
        test_df.insert(0, 'gender_' + column, gnd_dummies_test[column])

# Confirm data added correctly
# print(train_df.shape)
# print(train_df)
# print(test_df.shape)
# print(test_df)

# For consistency later and to avoid repeating this step, we will save the training and test data to separate files
train_df.to_csv('diabetes_prediction_dataset_train.csv', index=False)
test_df.to_csv('diabetes_prediction_dataset_test.csv', index=False) # Note, this will be reserved exclusively for model analysis AFTER training

# Centralized Logistic Regression
## Scenario
In this scenario, we create a logistic regression model that is trained to predict the reponse `diabetes` using the predictors `gender`, `age`, `hypertension`, `heart_disease`, `smoking_history`, `bmi`, `HbA1c_level`, and `blood_glucose_level`. We assume that all training data and the training task itself is centralized to one system. Here, we train the model using k-fold cross-validation.

### Implementation
#### Data Processing
We begin by importing the necessary packages and processing the training data. Here we require the training data to be split again into training and validation data. Here, the traininig set is relatively large (~90000 samples) so we implement k-fold cross-validation where k=10. For this task, scikit-learn fortunately has a built-in function that will train the logistic regession model with cross validation.

In [4]:
# imports 
from sklearn.linear_model import LogisticRegressionCV
import numpy

# Data Preparation, get training data, isolate into predictors and response columns
cent_train = pd.read_csv('diabetes_prediction_dataset_train.csv')
X_train = cent_train.loc[:, cent_train.columns != 'diabetes'].to_numpy()
Y_train = cent_train.loc[:, cent_train.columns == 'diabetes'].to_numpy().ravel()

# Confirm shape should be cent_train: (n, 16), X_train: (n, 15), and Y_train (n,) where n is ~90000
print('Size:\n\tcent_train:\t{}\n\tX_train:\t{}\n\tY_train:\t{}'.format(cent_train.shape, X_train.shape, Y_train.shape))

Size:
	cent_train:	(89999, 16)
	X_train:	(89999, 15)
	Y_train:	(89999,)


#### Model Definition and Training
Here, the logistic regression model is defined. `LogisticRegressionCV` has a `cv` parameter which governs the number of folds in the cross-validation method. Because we are performing 10-fold cross-validation, we set `cv=10`. Finally, we tell select to use the Stochastic Average Gradient decent algorithm as per scikit's recommendations since it runs faster for larger datasets. Therefore, we set `solver='sag'`. SAG is also only compatible with the L2 penalty term so we need to set the `penalty` parameter to `penalty='l2'`. `random_state` is another parameter which should be set when using SAG. We set `random_state=0`. We also set the `max_iter` parameter to `10000` which halts learning when the model converges or when the maximum number of training iterations/epochs has been reached; whichever comes first. The `n_jobs` parameter is set to 5 to enable multithreaded training on 5 CPU cores. This speeds up model training by simultaneously training the model with 5 of the 10 folds. `verbose=1` simply allows us to view the training progress. 

In [5]:
# Model Definition
cModel_LR = LogisticRegressionCV(cv=10, solver='sag', max_iter=10000, random_state=0, penalty='l2', n_jobs=5, verbose=1)

In [6]:
# Model Training
cModel_LR.fit(X_train, Y_train)

[Parallel(n_jobs=5)]: Using backend ThreadingBackend with 5 concurrent workers.


convergence after 948 epochs took 70 seconds
convergence after 947 epochs took 73 seconds
convergence after 947 epochs took 73 seconds
convergence after 947 epochs took 74 seconds
convergence after 947 epochs took 75 seconds
convergence after 939 epochs took 64 seconds
convergence after 938 epochs took 67 seconds
convergence after 938 epochs took 68 seconds
convergence after 938 epochs took 71 seconds
convergence after 941 epochs took 70 seconds
convergence after 1060 epochs took 72 seconds
convergence after 1068 epochs took 74 seconds
convergence after 1066 epochs took 75 seconds
convergence after 1056 epochs took 77 seconds
convergence after 1067 epochs took 78 seconds
convergence after 748 epochs took 48 seconds
convergence after 751 epochs took 52 seconds
convergence after 751 epochs took 52 seconds
convergence after 747 epochs took 53 seconds
convergence after 756 epochs took 53 seconds
convergence after 1092 epochs took 72 seconds
convergence after 1101 epochs took 75 seconds
con

[Parallel(n_jobs=5)]: Done  10 out of  10 | elapsed: 13.2min finished


#### Centralized Model Evaluation


In [7]:
cent_test = pd.read_csv('diabetes_prediction_dataset_test.csv')
X_test = cent_train.loc[:, cent_train.columns != 'diabetes'].to_numpy()
Y_test = cent_train.loc[:, cent_train.columns == 'diabetes'].to_numpy().ravel()
y_pred = cModel_LR.predict(X_test)

In [14]:
from sklearn import metrics
from sklearn.metrics import classification_report

print(classification_report(Y_test, y_pred))

              precision    recall  f1-score   support

           0       0.97      0.99      0.98     82374
           1       0.87      0.62      0.73      7625

    accuracy                           0.96     89999
   macro avg       0.92      0.81      0.85     89999
weighted avg       0.96      0.96      0.96     89999



### DELETE ME BUT GOOD REFERENCES 
https://www.datacamp.com/tutorial/understanding-logistic-regression-python
https://www.section.io/engineering-education/how-to-implement-k-fold-cross-validation/
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_numpy.html
https://www.google.com/search?client=firefox-b-d&q=the+max_iter+was+reached+which+means
https://www.google.com/search?client=firefox-b-d&q=can+sklearn+use+gpu
https://www.google.com/search?q=scikit+learn+replace+model+coefficients&client=firefox-b-d&sxsrf=APwXEdeNYOhw-LS79iqhPKrhPg5uXq_VXg%3A1682610796889&ei=bJpKZM7uNbOcptQP5eCW6AE&oq=scikit+learn+replace+model+coe&gs_lcp=Cgxnd3Mtd2l6LXNlcnAQAxgAMgUIIRCgAToKCAAQRxDWBBCwAzoECCMQJzoHCAAQigUQQzoICAAQigUQkQI6BQgAEIAEOgoIABCABBAUEIcCOgYIABAWEB46CAgAEBYQHhAPOggIABAWEB4QCjoICCEQFhAeEB06BAghEBVKBAhBGABQzRxYmHJgiIABaAFwAXgAgAHjAYgBiBCSAQY3LjEwLjGYAQCgAQHIAQjAAQE&sclient=gws-wiz-serp
https://stackoverflow.com/questions/24438779/creating-a-sklearn-linear-model-logisticregression-instance-from-existing-coeffi
https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegressionCV.html#sklearn.linear_model.LogisticRegressionCV
https://www.google.com/search?q=sklearn+logistic+regression&client=firefox-b-d&sxsrf=APwXEde385EBa7GBKCN58NQH7JRKZx_gcQ%3A1682602212713&ei=5HhKZI-dK5upptQP7ZiZ-Ac&oq=sklearn+&gs_lcp=Cgxnd3Mtd2l6LXNlcnAQAxgBMgQIIxAnMgcIABCKBRBDMgcIABCKBRBDMgcIABCKBRBDMgcIABCKBRBDMgcIABCKBRBDMgcIABCKBRBDMggIABCKBRCRAjIHCAAQigUQQzIICAAQigUQkQI6CggAEEcQ1gQQsAM6CggAEIoFELEDEENKBAhBGABQ4BJYzh1gqS5oAnABeACAAXaIAfAFkgEDNC40mAEAoAEByAEIwAEB&sclient=gws-wiz-serp#ip=1