# Federated MTGNN For Blood Glucose Prediction OhioT1DM Dataset

## Introduction

In this report, we aim to predict glucose levels using two different methods implemented in PyTorch with the Federated Learning approach. We will describe the design of the models, compare their performance, and visualize the results.



## Dataset


In this section, we'll visualize the dataset and examine some statistics about its features. Additionally, we'll employ various imputation methods and assess their outputs.

Within the **utils.py** file, there's a function called **get_dataset**. This function manages the loading of the dataset by applying the chosen imputation method. It also converts the data into tensors to ready it for the training stage.

In [9]:
import os
from collections import defaultdict
import numpy as np 
import pandas as pd
from scipy import interpolate
from sklearn.impute import KNNImputer
import matplotlib.pyplot as plt
import seaborn as sns


def KNN_interpolate(df, k=5):
    imputer = KNNImputer(n_neighbors=k, keep_empty_features=True)
    imputed_df = imputer.fit_transform(df)
    imputed_df = pd.DataFrame(imputed_df, columns=df.columns, index=df.index)
    return imputed_df

def cubic_interpolate(df):
    df_np = df.values
    missing_mask = np.isnan(df_np) * 1
    df_impute_np = np.zeros(df_np.shape)
    x_range = np.arange(df_np.shape[0])
    for col_index in range(df_np.shape[1]):
        not_missing_indexes = np.where(missing_mask[:,col_index] == 0)[0]
        if len(not_missing_indexes) == 0:
            continue
        cs_impute = interpolate.CubicSpline(not_missing_indexes, df_np[not_missing_indexes, col_index])
        df_impute_np[:,col_index] = cs_impute(x_range)
    return pd.DataFrame(df_impute_np, columns=df.columns, index=df.index)

In [2]:
ohio_directory = 'data/Ohio Data/'
ohio_folders = []
data_dict = defaultdict(dict)
for folder in os.listdir(ohio_directory):
    if folder.startswith("Ohio"):
        ohio_folders.append(os.path.join(ohio_directory, folder))
for folder in ohio_folders:
    train_dir = os.path.join(folder, "train")
    for file in os.listdir(train_dir):
        data_dict[file.split("-")[0]]["train"] = pd.read_csv(os.path.join(train_dir, file), index_col=0)
        
all_train_data = []
all_train_data_imputed_cubic = []
all_train_data_imputed_knn = []
for patient in data_dict.keys():
    train_df = data_dict[patient]["train"]
    train_impute_df_cubic = cubic_interpolate(train_df)
    train_impute_df_knn = KNN_interpolate(train_df)
    
    all_train_data.append(train_df)
    all_train_data_imputed_cubic.append(train_impute_df_cubic)
    all_train_data_imputed_knn.append(train_impute_df_knn)

all_train_data = pd.concat(all_train_data, ignore_index=True)
all_train_data_imputed_cubic = pd.concat(all_train_data_imputed_cubic, ignore_index=True)
all_train_data_imputed_knn = pd.concat(all_train_data_imputed_knn, ignore_index=True)

### Looking at Data

We notice that several features have many missing values, such as **hr**, which is entirely absent in some samples. To standardize the features, we plan to use min-max normalization. However, before proceeding, it's crucial to address the missing values since our model cannot handle them directly.

In [3]:
all_train_data.describe()

Unnamed: 0,missing_cbg,cbg,finger,basal,hr,gsr,carbInput,bolus
count,153039.0,134788.0,3787.0,151749.0,70757.0,117882.0,1774.0,2946.0
mean,0.119257,158.948853,157.611566,1.000059,79.853682,0.85909,44.740699,5.989664
std,0.324092,60.717516,75.508538,0.417486,16.030767,3.392687,33.328517,4.382126
min,0.0,40.0,0.0,0.0,45.0,0.0,0.0,0.0
25%,0.0,113.0,106.0,0.73,67.0,8.9e-05,20.0,2.7
50%,0.0,151.0,151.0,0.98,79.0,0.010315,38.0,5.0
75%,0.0,197.0,203.0,1.25,90.0,0.201129,60.0,8.5
max,1.0,400.0,586.0,2.34,189.0,75.074359,450.0,25.0


In [None]:
sns.boxplot(all_train_data)
plt.show()

### Imputation

For the imputation process, many papers have utilized cubic spline interpolation. However, due to the significant number of consecutive missing values in some features, this method may generate out-of-bound numbers. The results of cubic interpolation reveal that the minimum and maximum values are far from the expected range. Additionally, after normalization, all other values may become very similar.

To address these issues, I opted for KNN imputation. One advantage of KNN is that imputed values will not be out of bounds. However, it's crucial to carefully select a similarity method. One drawback of KNN is the time it takes to impute the data, approximately 4 minutes in this case. In the code, the imputation result is saved in a temporary directory to facilitate faster results when loading the data.

In [11]:
all_train_data_imputed_cubic.describe()

Unnamed: 0,missing_cbg,cbg,finger,basal,hr,gsr,carbInput,bolus
count,153039.0,153039.0,153039.0,153039.0,153039.0,153039.0,153039.0,153039.0
mean,0.119257,156.663445,155.315233,0.996516,23109780.0,-195.273406,67.930752,6.117347
std,0.324092,97.866838,224.419422,0.418827,408292600.0,6450.530074,919.306476,35.388577
min,0.0,-1535.924661,-16541.075015,-0.148502,-3.013359,-562348.078719,-11770.370029,-2346.670438
25%,0.0,110.0,109.354371,0.73,0.0,5.9e-05,19.800351,2.237723
50%,0.0,151.0,154.614983,0.98,0.0,0.000331,42.164623,4.818995
75%,0.0,199.0,209.264304,1.25,79.0,0.121148,69.061182,9.121388
max,1.0,840.037912,4715.500197,2.34,12649520000.0,1181.777602,35342.11488,3432.138461


In [12]:
all_train_data_imputed_knn.describe()

Unnamed: 0,missing_cbg,cbg,finger,basal,hr,gsr,carbInput,bolus
count,153039.0,153039.0,153039.0,153039.0,153039.0,153039.0,153039.0,153039.0
mean,0.119257,159.594166,159.698566,0.996999,39.595663,1.051105,48.105744,5.158674
std,0.324092,58.989273,47.369874,0.417799,41.52683,3.971518,24.832617,2.90026
min,0.0,40.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.0,116.0,128.2,0.73,0.0,0.000104,27.2,2.88
50%,0.0,152.2,156.8,0.98,0.0,0.032423,42.8,4.48
75%,0.0,194.0,189.4,1.25,78.0,0.347168,61.2,6.66
max,1.0,400.0,586.0,2.34,189.0,75.074359,450.0,25.0


## Model Architectures

Note: Both models are defined in the **models.py** file.
### Model 1: [Bi-LSTM]

The Bi-LSTM method described here is based on a paper that demonstrated the best results in a review paper. The model employs a bidirectional LSTM architecture. The outputs of the LSTM are connected to two fully connected layers. Below is the summary of the model as described in the original paper.

<img src="assets/Bi-LSTM.png" width="30%"/>

### Model 2: [MTGNN]

The second method we're employing is MTGNN, originally designed to forecast traffic data. In this model, we learn an embedding for each node in the graph (in our case, each feature is a node). Using these embeddings, we compute the similarity between nodes and construct a graph based on this similarity. Subsequently, we utilize graph convolution to predict future values. The temporal convolution module uses a technique called Dilated Inception to increase the receptive field of the model without having many convolution layers or huge kernel size.

<img src="assets/MTGNN.png" width="50%"/>

## Training Process

The training process begins in the **federated_main.py** file. First, it sets up the global model. Then, for each client and in each training round, it loads the previous global model and trains on its specific data.

After training, it combines the new weights of the model and replaces them with the previous global model to keep it updated.

Additionally, the **grid_search** function in this file divides the training data into training and validation sets. For each set of parameters, it runs the training process and saves the chosen parameters along with the results of the global model in separate CSV files.

## Results

  |Number of Clients | Model | TRAIN MAE | TRAIN MSE | TEST MAE | TEST MSE | 
  |----- | --------------------------- | ----------- | --- | ------ | ----- |
  |1   |  MTGNN                     | 15.76           | 554.99 | 14.60    | 467.32   |
  |4   |  MTGNN                     | 16.12           | 589.34| 15.01    | 504.85   |
  | 1   |  BiLSTM                     | 15.85           | 546.25 | 15.03    | 477.69   | 
  | 4   |  BiLSTM                     | 16.15           | 606.92 | 14.99    | 508.80   | 


### Plotting Global model MAE on train and test data

#### MTGNN with 4 Clients
<img src="assets/loss_MTGNN_4.png" width="90%"/>

#### BiLSTM with 4 Clients
<img src="assets/loss_BiLSTM_4.png" width="90%"/>

## Future Todo

- **Handling missing data in the model**: Employing a model like MTGODE, which models the continuous dynamics of latent space, could be advantageous for handling missing data within the model itself. Neural Ordinary Differential Equations Networks (ODE) might offer improvements in this regard.
- **Alternative similarity method for KNN**:  In the imputation process, KNN currently relies on simple Euclidean distance to measure similarity. Adjusting this method to place more emphasis on important features could enhance the quality of the imputed data.
- **Plotting number of clients**: It would be useful to have a graph that shows how training and validation change when we use different numbers of clients.

## References

The papers and GitHub repositories referenced in this implementation are as follows:

- Review paper: [A Critical Review of the state-of-the-art on Deep Neural Networks for Blood Glucose Prediction in Patients with Diabetes](https://arxiv.org/abs/2109.02178).
- Bi-LSTM paper: [Predicting Blood Glucose with an LSTM and Bi-LSTM Based Deep Neural Network](https://arxiv.org/abs/1809.03817).
- MTGNN paper: [Connecting the Dots: Multivariate Time Series Forecasting with Graph Neural Networks](https://arxiv.org/abs/2005.11650), [Github](https://github.com/nnzhan/MTGNN).
- Federated Learning GitHub Repository: [Github](https://github.com/AshwinRJ/Federated-Learning-PyTorch/tree/master).
- FastTensorDataLoader GitHub Repository: [Github](https://github.com/hcarlens/pytorch-tabular/blob/master/fast_tensor_data_loader.py).