# Diabetes Prediction
#### Data Source
From Kaggle by user Mohammed Mustafa 
Link https://www.kaggle.com/datasets/iammustafatz/diabetes-prediction-dataset
### Description
The data contains predictors `gender`, `age`, `hypertension`, `heart_disease`, `smoking_history`, `bmi`, `HbA1c_level`, `blood_glucose_level`, and the response `diabetes` for 100000 patients.
- `gender` may have one of two qualitative values `male` or `female`
- `age` may be any integer value from 0-80
- `hypertension` is one-hot encoded where 0 and 1 are the absense and presence of hypertension exhibited by the patient, respectively
- `heart_disease` is one-hot encoded where 0 and 1 are the absense and presence of heart disease exhibited by the patient, respectively
- `smoking_history` may be one of six qualitative values, `not current`, `former`, `No Info`, `current`, `never`, and `ever`
- `bmi` is the body mass index of the patient and may be any real number between 10 and 95.7
- `HbA1c_level` is the level of hemoglobic A1c measured in the pateints blood and may be a real number from 3.5 to 9
- `blood_glucose_level` is the level of glucose in the patient's bloodstream and may be a real nmumber from 80 to 300
- `diabetes` is one-hot encoded where 0 and 1 are the absense and presence of diabetes exhibited by the patient, respectively

### Training and Test Split
The data will be randomly split into training and test data. 10% of the data will be removed exclusively for model testing and analysis toward the end of the experiment. The remaining 90% of data will be used for training purposes. 

## Centralized and Federated Environments
### Centralized
In this scenario, all data is stored in one location and training using a single logistic regression model. This model will be analyzed and provide a baseline against which the federated learning tests will be compared.

### Federated
In this scenario, the training data is randomly distributed into 8 different smaller sets which will be used to train 8 invididual logistic regression models concurrently. This will simulate 8 different clients training on their own local data. Initially, the bins will contain roughly the same amount of data. To explore the effects of an imbalanced distribution of data, another experiment will assess model training when this data is not equally distributed across the clients. It is important to note that between training rounds, the clients will send model parameters (in this case the model coefficients) to a central process acting as a server. This server will aggregate model updates to generate a new global model which all clients will use to update their own local model before the next round of training.


## Installing Dependencies

In [None]:
#!pip install numpy pandas sklearn

## Initial Data Processing

In [1]:
import os

# Change directory to the location holding the data
os.chdir('../data')
os.getcwd()

'C:\\Repositories\\COMP8590-StatisticalLearning\\data'

In [3]:
import random
from math import ceil
import pandas as pd
import sys

random.seed(0)
# Split the data, 90% Training, 10% Test
# Test data only used for final evaluation

# Count lines in file
data_file = open('diabetes_prediction_dataset.csv')
lines_data = sum(1 for line in data_file)

# Reset file pointer
data_file.seek(0)

# Randomly Select Train and test Samples
select_test = random.sample(range(1, lines_data), ceil(lines_data * 0.1))
select_train = [row for row in range(1, lines_data) if row not in select_train]

# Load in test data
train_df = pd.read_csv('diabetes_prediction_dataset.csv', skiprows=select_test)
test_df = pd.read_csv('diabetes_prediction_dataset.csv', skiprows=select_train)

# For consistency later and to avoid repeating this step, we will save the training and test data to separate files
train_df.to_csv('diabetes_prediction_dataset_train.csv', index=False)
test_df.to_csv('diabetes_prediction_dataset_test.csv', index=False) # Note, this will be reserved exclusively for model analysis AFTER training

# Centralized Logistic Regression
## P

In [None]:
import sklearn import linear_model
import numpy

