## Practical time series missing values imputation techniques

In this notebook, we explore various missing value imputation techniques for time series data using the PhysioNet Challenge 2012 dataset. The objective of this workshop is to introduce different imputation methods and demonstrate their implementations using open-source libraries.

For the PhysioNet 2012 dataset, we compare several advanced imputation techniques with a naive baseline, focusing on their impact on machine learning performance. Specifically, we train a machine learning model to perform a downstream prediction task and analyze the differences in performance between the naive baseline and the versions with imputed data.

The primary library used in this workshop is **PyPOTS**, a versatile framework that incorporates state-of-the-art models for time series imputation and continues to expand.

### Install and load necessary libraries

In [1]:
!pip install pypots
!pip install tsdb
!pip install torch_geometric torch_scatter torch_sparse

2024-09-03 17:13:43.899060: I external/local_xla/xla/tsl/cuda/cudart_stub.cc:32] Could not find cuda drivers on your machine, GPU will not be used.
2024-09-03 17:13:47.881592: I external/local_xla/xla/tsl/cuda/cudart_stub.cc:32] Could not find cuda drivers on your machine, GPU will not be used.
2024-09-03 17:13:51.633371: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:485] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-09-03 17:13:54.590212: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:8454] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-09-03 17:13:55.389418: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1452] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-09-03 17:14:00.147597: I tensorflow/core/platform/cpu_feature_gu

In [None]:
import torch

!pip uninstall torch-scatter torch-sparse torch-geometric torch-cluster  --y
!pip install torch-scatter -f https://data.pyg.org/whl/torch-{torch.__version__}.html
!pip install torch-sparse -f https://data.pyg.org/whl/torch-{torch.__version__}.html
!pip install torch-cluster -f https://data.pyg.org/whl/torch-{torch.__version__}.html
!pip install git+https://github.com/pyg-team/pytorch_geometric.git

In [1291]:
import pypots
import tsdb
from pypots.optim import Adam
from pypots.classification import Raindrop
from pypots.utils.metrics import calc_binary_classification_metrics
from pypots.utils.random import set_random_seed
import pandas as pd
import numpy as np
import urllib.request
import io

In [1295]:
def load_numpy_from_url(url):
	"""
	Load a numpy array from a URL.

	Parameters:
		url (str): The URL to load the numpy array from.

	Return:
		np.ndarray: The numpy array.
	"""
	# Fetch the data from the URL
	response = urllib.request.urlopen(url)

	# Stream the data into a numpy array
	return np.load(io.BytesIO(response.read()))

### Data Cleaning

In [1169]:
data = tsdb.load("physionet_2012")
feature = pd.concat([data["set-a"], data["set-b"], data["set-c"]], axis=0)
label = pd.concat([data["outcomes-a"], data["outcomes-b"], data["outcomes-c"]], axis=0)

2024-09-05 06:48:14 [INFO]: You're using dataset physionet_2012, please cite it properly in your work. You can find its reference information at the below link: 
https://github.com/WenjieDu/TSDB/tree/main/dataset_profiles/physionet_2012
2024-09-05 06:48:14 [INFO]: Dataset physionet_2012 has already been downloaded. Processing directly...
2024-09-05 06:48:14 [INFO]: Dataset physionet_2012 has already been cached. Loading from cache directly...


2024-09-05 06:48:14 [INFO]: Loaded successfully!


In [1164]:
n_steps = 48
missing_threshold = 0.7

We retain records that span over 48 hours and have less than 70% missing values.

In [1165]:
filtered_feature = (
    feature.groupby("RecordID")
    .filter(
        lambda group: len(group) == n_steps  # 48 hours of data
        and (group.isnull().sum().sum() / group.size)
        < missing_threshold  # Missing values < missing_threshold
    )
    .dropna(axis=1, how="all")  # Drop columns where all elements are NaN
    .sort_values(by=["RecordID", "Time"])
    .reset_index(drop=True)
)

### Imputation

In [1166]:
def dataframe_to_3d_array(df_feature, df_label, groupby_col):
    """
    Converts a DataFrame into a 3D NumPy array for features and a 1D NumPy array for labels.

    Parameters:
    df_feature: 	pandas.DataFrame, the feature DataFrame containing data to be converted into a 3D array.
    df_label: 	 	pandas.Series, the label Series containing the labels, indexed by `groupby_col` values.
    groupby_col:	str, the column in `df_feature` used to group the data.

    Returns:
    feature_3d: 	np.ndarray, A 3D NumPy array where each "slice" (first dimension) corresponds to one group of features
    label_1d : 		np.ndarray, A 1D NumPy array containing the labels corresponding to each group in the same order
    """

    grouped = df_feature.groupby(groupby_col)

    groupby_ids, arrays = zip(
        *[
            (groupby_id, group.drop(columns=["RecordID"]).values)
            for groupby_id, group in grouped
        ]
    )

    feature_3d = np.stack(arrays)
    label_1d = df_label.loc[list(groupby_ids)].to_numpy().flatten()

    return feature_3d, label_1d

#### Mean Imputation

In [1167]:
# TODO: Calculate the overall mean for each column across the entire dataset
overall_means = ...

# Group by 'RecordID' and fill NaN values with the group's mean where possible
# For columns where all values are NaN, fill them with the overall column mean
mean_imputed_feature = (
    filtered_feature.groupby("RecordID")
    .apply(
        lambda group: group.fillna(group.mean()).fillna(  # Fill with group's mean
            overall_means
        )
    )  # Fill remaining NaNs (i.e., columns where all values were NaN) with overall mean
    .reset_index(drop=True)  # Drop the 'RecordID' index
    .sort_values(by=["RecordID", "Time"])
    .reset_index(drop=True)
)

  .apply(


In [1170]:
mean_imputed, mean_label = dataframe_to_3d_array(
    df_feature=mean_imputed_feature, df_label=label, groupby_col="RecordID"
)

#### Conditional Score-based Diffusion Model for Probabilistic Timer Series Imputation (CSDI)

Although autoregressive models are commonly used for time series imputation, score-based diffusion models have recently surpassed them in performance across various tasks, including image generation and audio synthesis. Given these advancements, it’s worth exploring their potential for time series imputation as well.

In [1171]:
filtered_feature_3d, filtered_label = dataframe_to_3d_array(
    df_feature=filtered_feature, df_label=label, groupby_col="RecordID"
)

# TODO: fill in the corresponding datasets
dataset_for_imputation = {
    "X": ...,
    "y": ...,
}

n_features = dataset_for_imputation["X"].shape[-1]

In [1175]:
from pypots.imputation import CSDI

set_random_seed(42)

# initialize the model
csdi = CSDI(
    n_features=n_features,
    n_steps=1,
    n_layers=3,
    n_heads=2,
    n_channels=128,
    d_time_embedding=64,
    d_feature_embedding=32,
    d_diffusion_embedding=128,
    target_strategy="random",
    n_diffusion_steps=50,
    batch_size=32,
    # here we set epochs=10 for a quick demo, you can set it to 100 or more for better performance
    epochs=10,
    # give the optimizer. Different from torch.optim.Optimizer, you don't have to specify model's parameters when
    # initializing pypots.optim.Optimizer. You can also leave it to default. It will initilize an Adam optimizer with lr=0.001.
    optimizer=Adam(lr=1e-3),
    # this num_workers argument is for torch.utils.data.Dataloader. It's the number of subprocesses to use for data loading.
    # Leaving it to default as 0 means data loading will be in the main process, i.e. there won't be subprocesses.
    # You can increase it to >1 if you think your dataloading is a bottleneck to your model training speed
    num_workers=0,
    # just leave it to default as None, PyPOTS will automatically assign the best device for you.
    # Set it as 'cpu' if you don't have CUDA devices. You can also set it to 'cuda:0' or 'cuda:1' if you have multiple CUDA devices, even parallelly on ['cuda:0', 'cuda:1']
    device=None,
    # only save the best model after training finished.
    # You can also set it as "better" to save models performing better ever during training.
    model_saving_strategy="best",
)


# train the model on the training set, and validate it on the validating set to select the best model for testing in the next step
csdi.fit(train_set=dataset_for_imputation)

2024-09-05 07:06:16 [INFO]: Have set the random seed as 42 for numpy and pytorch.
2024-09-05 07:06:16 [INFO]: No given device, using default device: cpu
2024-09-05 07:06:16 [INFO]: Model files will be saved to ./csdi/20240905_T070616
2024-09-05 07:06:16 [INFO]: Tensorboard file will be saved to ./csdi/20240905_T070616/tensorboard
2024-09-05 07:06:16 [INFO]: CSDI initialized with the given hyperparameters, the number of trainable parameters: 873,153
2024-09-05 07:08:31 [INFO]: Epoch 001 - training loss: 0.8558
2024-09-05 07:08:31 [INFO]: Saved the model to ./csdi/20240905_T070616/CSDI_epoch1_loss0.8558165751970731.pypots
2024-09-05 07:10:53 [INFO]: Epoch 002 - training loss: 0.7794
2024-09-05 07:10:54 [INFO]: Saved the model to ./csdi/20240905_T070616/CSDI_epoch2_loss0.7794200258377271.pypots
2024-09-05 07:13:05 [INFO]: Epoch 003 - training loss: 0.7343
2024-09-05 07:13:06 [INFO]: Saved the model to ./csdi/20240905_T070616/CSDI_epoch3_loss0.7342884846222706.pypots
2024-09-05 07:15:17 [I

In [1176]:
csdi_imputed = csdi.predict(dataset_for_imputation)["imputation"].squeeze(1)

#### Bidirectional Recurrent Imputation for Time Series (BRITS)

In [1273]:
from pypots.imputation import BRITS

set_random_seed(42)

# initialize the model
brits = BRITS(
    n_steps=n_steps,
    n_features=n_features,
    rnn_hidden_size=128,
    batch_size=32,
    # here we set epochs=10 for a quick demo, you can set it to 100 or more for better performance
    epochs=20,
    # give the optimizer. Different from torch.optim.Optimizer, you don't have to specify model's parameters when
    # initializing pypots.optim.Optimizer. You can also leave it to default. It will initilize an Adam optimizer with lr=0.001.
    optimizer=Adam(lr=1e-3),
    # this num_workers argument is for torch.utils.data.Dataloader. It's the number of subprocesses to use for data loading.
    # Leaving it to default as 0 means data loading will be in the main process, i.e. there won't be subprocesses.
    # You can increase it to >1 if you think your dataloading is a bottleneck to your model training speed
    num_workers=0,
    # just leave it to default as None, PyPOTS will automatically assign the best device for you.
    # Set it as 'cpu' if you don't have CUDA devices. You can also set it to 'cuda:0' or 'cuda:1' if you have multiple CUDA devices, even parallelly on ['cuda:0', 'cuda:1']
    device=None,
    # only save the best model after training finished.
    # You can also set it as "better" to save models performing better ever during training.
    model_saving_strategy="best",
)

brits.fit(train_set=dataset_for_imputation)

2024-09-05 12:45:17 [INFO]: No given device, using default device: cpu
2024-09-05 12:45:17 [INFO]: BRITS initialized with the given hyperparameters, the number of trainable parameters: 255,344
2024-09-05 12:45:40 [INFO]: Epoch 001 - training loss: 142.2712
2024-09-05 12:45:58 [INFO]: Epoch 002 - training loss: 108.8362
2024-09-05 12:46:20 [INFO]: Epoch 003 - training loss: 86.8353
2024-09-05 12:46:38 [INFO]: Epoch 004 - training loss: 77.1769
2024-09-05 12:46:56 [INFO]: Epoch 005 - training loss: 72.4962
2024-09-05 12:47:14 [INFO]: Epoch 006 - training loss: 69.0645
2024-09-05 12:47:33 [INFO]: Epoch 007 - training loss: 66.2519
2024-09-05 12:47:50 [INFO]: Epoch 008 - training loss: 63.5382
2024-09-05 12:48:11 [INFO]: Epoch 009 - training loss: 61.2091
2024-09-05 12:48:28 [INFO]: Epoch 010 - training loss: 59.0752
2024-09-05 12:48:46 [INFO]: Epoch 011 - training loss: 57.2260
2024-09-05 12:49:03 [INFO]: Epoch 012 - training loss: 55.3654
2024-09-05 12:49:22 [INFO]: Epoch 013 - training 

In [1275]:
brits_imputed = brits.predict(dataset_for_imputation)["imputation"]

### Classification

In [1286]:
from sklearn.model_selection import train_test_split

def split_data(imputed_data, label_data, test_size=0.2, val_size=0.2, random_state=42):
    """
    Split the imputed data and label data into training and testing sets.

    Parameters:
    imputed_data: 	np.ndarray, the imputed data to be split.
    label_data: 	np.ndarray, the label data to be split.
    test_size: 		float, optional, the proportion of the dataset to include in the test split.
    val_size: 		float, optional, the proportion of the dataset to include in the validation split.
    random_state: 	int, optional, random seed for reproducibility.

    Returns:
    datasets: 		dict, a dictionary containing the training, validation, and testing sets.
    """

    train_val_data, test_data, train_val_label, test_label = train_test_split(
        imputed_data,  # The 3D array of features
        label_data,  # Corresponding labels
        test_size=test_size,  # Test data size
        random_state=random_state,
    )

    train_data, val_data, train_label, val_label = train_test_split(
        train_val_data,   # Remaining train+validation features
        train_val_label,  # Remaining train+validation labels
        test_size=val_size
        / (
            1 - test_size
        ),  # Adjust validation size relative to the train+validation set
        random_state=random_state,
    )

    datasets = {
        "training": {"X": train_data, "y": train_label},
        "validation": {"X": val_data, "y": val_label},
        "testing": {"X": test_data, "y": test_label},
    }

    return datasets

#### Evaluating Mean Imputation

In [1287]:
# Mean Imputation
datasets = split_data(mean_imputed, mean_label)

For time series classification, we employ a graph neural network specifically optimized to capture time-dependent dependencies.

In [1288]:
set_random_seed()

# initialize the model
raindrop = Raindrop(
    n_steps=n_steps,
    n_features=n_features,
    n_classes=2,
    n_layers=2,
    d_model=n_features * 4,
    d_ffn=256,
    n_heads=2,
    dropout=0.4,
    batch_size=32,
    # here we set epochs=10 for a quick demo, you can set it to 100 or more for better performance
    epochs=20,
    # here we set patience=3 to early stop the training if the evaluting loss doesn't decrease for 3 epoches.
    # You can leave it to defualt as None to disable early stopping.
    patience=5,
    # give the optimizer. Different from torch.optim.Optimizer, you don't have to specify model's parameters when
    # initializing pypots.optim.Optimizer. You can also leave it to default. It will initilize an Adam optimizer with lr=0.001.
    optimizer=Adam(lr=5e-4),
    # this num_workers argument is for torch.utils.data.Dataloader. It's the number of subprocesses to use for data loading.
    # Leaving it to default as 0 means data loading will be in the main process, i.e. there won't be subprocesses.
    # You can increase it to >1 if you think your dataloading is a bottleneck to your model training speed
    num_workers=0,
    # just leave it to default as None, PyPOTS will automatically assign the best device for you.
    # Set it as 'cpu' if you don't have CUDA devices. You can also set it to 'cuda:0' or 'cuda:1' if you have multiple CUDA devices, even parallelly on ['cuda:0', 'cuda:1']
    device=None,
    model_saving_strategy="best",  # only save the best model after training finished.
    # You can also set it as "better" to save models performing better ever during training.
)

2024-09-05 12:58:21 [INFO]: Have set the random seed as 2022 for numpy and pytorch.
2024-09-05 12:58:21 [INFO]: No given device, using default device: cpu
2024-09-05 12:58:21 [INFO]: Raindrop initialized with the given hyperparameters, the number of trainable parameters: 1,541,396


In [1289]:
# Fit and predict
raindrop.fit(train_set=datasets["training"], val_set=datasets["validation"])
pred_mean = raindrop.predict(datasets["testing"])["classification"]

2024-09-05 12:58:34 [INFO]: Epoch 001 - training loss: 0.5450, validation loss: 0.4540
2024-09-05 12:58:43 [INFO]: Epoch 002 - training loss: 0.5259, validation loss: 0.4183
2024-09-05 12:58:53 [INFO]: Epoch 003 - training loss: 0.5061, validation loss: 0.4000
2024-09-05 12:59:02 [INFO]: Epoch 004 - training loss: 0.4926, validation loss: 0.4027
2024-09-05 12:59:10 [INFO]: Epoch 005 - training loss: 0.4841, validation loss: 0.4018
2024-09-05 12:59:18 [INFO]: Epoch 006 - training loss: 0.4734, validation loss: 0.4156
2024-09-05 12:59:28 [INFO]: Epoch 007 - training loss: 0.4934, validation loss: 0.4175
2024-09-05 12:59:40 [INFO]: Epoch 008 - training loss: 0.4795, validation loss: 0.4288
2024-09-05 12:59:40 [INFO]: Exceeded the training patience. Terminating the training procedure...
2024-09-05 12:59:40 [INFO]: Finished training. The best model is from epoch#3.


In [1290]:
# calculate the values of binary classification metrics on the model's prediction
metrics = calc_binary_classification_metrics(pred_mean, datasets["testing"]["y"])
print(
    "Testing classification metrics: \n"
    f'ROC_AUC: {metrics["roc_auc"]}, \n'
    f'Accuracy: {metrics["accuracy"]},\n'
    f'F1: {metrics["f1"]},\n'
    f'Precision: {metrics["precision"]},\n'
    f'Recall: {metrics["recall"]},\n'
)

Testing classification metrics: 
ROC_AUC: 0.6620135363790186, 
Accuracy: 0.8,
F1: 0.19672131147540983,
Precision: 0.46153846153846156,
Recall: 0.125,



#### Evaluating CSDI Imputation

In [1302]:
# CSDI Imputation

csdi_imputed = load_numpy_from_url(url='https://raw.githubusercontent.com/Alvorecer721/Missing_value_imputation_workshop/d870767db751380283661f62cbc3439f683433ed/data/time-series/csdi_imputed.npy')
label = load_numpy_from_url(url='https://raw.githubusercontent.com/Alvorecer721/Missing_value_imputation_workshop/d870767db751380283661f62cbc3439f683433ed/data/time-series/label.npy')
datasets = split_data(csdi_imputed, label)

In [1252]:
set_random_seed()

# initialize the model
raindrop = Raindrop(
    n_steps=n_steps,
    n_features=n_features,
    n_classes=2,
    n_layers=2,
    d_model=n_features * 4,
    d_ffn=256,
    n_heads=2,
    dropout=0.4,
    batch_size=32,
    # here we set epochs=10 for a quick demo, you can set it to 100 or more for better performance
    epochs=20,
    # here we set patience=3 to early stop the training if the evaluting loss doesn't decrease for 3 epoches.
    # You can leave it to defualt as None to disable early stopping.
    patience=5,
    # give the optimizer. Different from torch.optim.Optimizer, you don't have to specify model's parameters when
    # initializing pypots.optim.Optimizer. You can also leave it to default. It will initilize an Adam optimizer with lr=0.001.
    optimizer=Adam(lr=5e-4),
    # this num_workers argument is for torch.utils.data.Dataloader. It's the number of subprocesses to use for data loading.
    # Leaving it to default as 0 means data loading will be in the main process, i.e. there won't be subprocesses.
    # You can increase it to >1 if you think your dataloading is a bottleneck to your model training speed
    num_workers=0,
    # just leave it to default as None, PyPOTS will automatically assign the best device for you.
    # Set it as 'cpu' if you don't have CUDA devices. You can also set it to 'cuda:0' or 'cuda:1' if you have multiple CUDA devices, even parallelly on ['cuda:0', 'cuda:1']
    device=None,
    model_saving_strategy="best",  # only save the best model after training finished.
    # You can also set it as "better" to save models performing better ever during training.
)

2024-09-05 11:45:12 [INFO]: Have set the random seed as 2022 for numpy and pytorch.
2024-09-05 11:45:12 [INFO]: No given device, using default device: cpu
2024-09-05 11:45:12 [INFO]: Model files will be saved to tutorial_results/classification/raindrop/20240905_T114512
2024-09-05 11:45:12 [INFO]: Tensorboard file will be saved to tutorial_results/classification/raindrop/20240905_T114512/tensorboard
2024-09-05 11:45:12 [INFO]: Raindrop initialized with the given hyperparameters, the number of trainable parameters: 1,541,396


In [1253]:
# Fit amd predict
raindrop.fit(train_set=datasets["training"], val_set=datasets["validation"])

pred_csdi = raindrop.predict(datasets["testing"])["classification"]

2024-09-05 11:45:22 [INFO]: Epoch 001 - training loss: 0.5450, validation loss: 0.4540
2024-09-05 11:45:22 [INFO]: Saved the model to tutorial_results/classification/raindrop/20240905_T114512/Raindrop_epoch1_loss0.4539904296398163.pypots
2024-09-05 11:45:30 [INFO]: Epoch 002 - training loss: 0.5259, validation loss: 0.4183
2024-09-05 11:45:30 [INFO]: Saved the model to tutorial_results/classification/raindrop/20240905_T114512/Raindrop_epoch2_loss0.41831769049167633.pypots
2024-09-05 11:45:40 [INFO]: Epoch 003 - training loss: 0.5061, validation loss: 0.4000
2024-09-05 11:45:40 [INFO]: Saved the model to tutorial_results/classification/raindrop/20240905_T114512/Raindrop_epoch3_loss0.39998289197683334.pypots
2024-09-05 11:45:48 [INFO]: Epoch 004 - training loss: 0.4926, validation loss: 0.4027
2024-09-05 11:45:56 [INFO]: Epoch 005 - training loss: 0.4841, validation loss: 0.4018
2024-09-05 11:46:04 [INFO]: Epoch 006 - training loss: 0.4734, validation loss: 0.4156
2024-09-05 11:46:13 [IN

In [1266]:
# calculate the values of binary classification metrics on the model's prediction
metrics = calc_binary_classification_metrics(pred_csdi, datasets["testing"]["y"])
print(
    "Testing classification metrics: \n"
    f'ROC_AUC: {metrics["roc_auc"]}, \n'
    f'Accuracy: {metrics["accuracy"]},\n'
    f'F1: {metrics["f1"]},\n'
    f'Precision: {metrics["precision"]},\n'
    f'Recall: {metrics["recall"]},\n'
)

Testing classification metrics: 
ROC_AUC: 0.7125634517766498, 
Accuracy: 0.8040816326530612,
F1: 0.4,
Precision: 0.5,
Recall: 0.3333333333333333,



#### Evaluating BRITS Imputation

In [1301]:
# BRITS Imputation
brits_imputed = load_numpy_from_url(url='https://raw.githubusercontent.com/Alvorecer721/Missing_value_imputation_workshop/d870767db751380283661f62cbc3439f683433ed/data/time-series/brits_imputed.npy')
label = load_numpy_from_url(url='https://raw.githubusercontent.com/Alvorecer721/Missing_value_imputation_workshop/d870767db751380283661f62cbc3439f683433ed/data/time-series/label.npy')
datasets = split_data(brits_imputed, label)

In [1280]:
set_random_seed()

# initialize the model
raindrop = Raindrop(
    n_steps=n_steps,
    n_features=n_features,
    n_classes=2,
    n_layers=2,
    d_model=n_features * 4,
    d_ffn=256,
    n_heads=2,
    dropout=0.4,
    batch_size=32,
    # here we set epochs=10 for a quick demo, you can set it to 100 or more for better performance
    epochs=20,
    # here we set patience=3 to early stop the training if the evaluting loss doesn't decrease for 3 epoches.
    # You can leave it to defualt as None to disable early stopping.
    patience=5,
    # give the optimizer. Different from torch.optim.Optimizer, you don't have to specify model's parameters when
    # initializing pypots.optim.Optimizer. You can also leave it to default. It will initilize an Adam optimizer with lr=0.001.
    optimizer=Adam(lr=5e-4),
    # this num_workers argument is for torch.utils.data.Dataloader. It's the number of subprocesses to use for data loading.
    # Leaving it to default as 0 means data loading will be in the main process, i.e. there won't be subprocesses.
    # You can increase it to >1 if you think your dataloading is a bottleneck to your model training speed
    num_workers=0,
    # just leave it to default as None, PyPOTS will automatically assign the best device for you.
    # Set it as 'cpu' if you don't have CUDA devices. You can also set it to 'cuda:0' or 'cuda:1' if you have multiple CUDA devices, even parallelly on ['cuda:0', 'cuda:1']
    device=None,
    model_saving_strategy="best",  # only save the best model after training finished.
    # You can also set it as "better" to save models performing better ever during training.
)

2024-09-05 12:53:57 [INFO]: Have set the random seed as 2022 for numpy and pytorch.
2024-09-05 12:53:57 [INFO]: No given device, using default device: cpu
  nn.init.xavier_uniform(self.R_u)  # xavier_uniform also known as glorot
2024-09-05 12:53:57 [INFO]: Raindrop initialized with the given hyperparameters, the number of trainable parameters: 1,541,396


In [1281]:
# Fit amd predict
raindrop.fit(train_set=datasets["training"], val_set=datasets["validation"])

pred_brits = raindrop.predict(datasets["testing"])["classification"]

2024-09-05 12:54:25 [INFO]: Epoch 001 - training loss: 0.5352, validation loss: 0.4740
2024-09-05 12:54:34 [INFO]: Epoch 002 - training loss: 0.5304, validation loss: 0.4389
2024-09-05 12:54:43 [INFO]: Epoch 003 - training loss: 0.5050, validation loss: 0.4072
2024-09-05 12:54:52 [INFO]: Epoch 004 - training loss: 0.4922, validation loss: 0.3959
2024-09-05 12:55:00 [INFO]: Epoch 005 - training loss: 0.4923, validation loss: 0.4118
2024-09-05 12:55:09 [INFO]: Epoch 006 - training loss: 0.4835, validation loss: 0.4187
2024-09-05 12:55:18 [INFO]: Epoch 007 - training loss: 0.4897, validation loss: 0.4231
2024-09-05 12:55:28 [INFO]: Epoch 008 - training loss: 0.4855, validation loss: 0.3985
2024-09-05 12:55:37 [INFO]: Epoch 009 - training loss: 0.4604, validation loss: 0.3956
2024-09-05 12:55:50 [INFO]: Epoch 010 - training loss: 0.4489, validation loss: 0.4006
2024-09-05 12:55:59 [INFO]: Epoch 011 - training loss: 0.4534, validation loss: 0.4268
2024-09-05 12:56:08 [INFO]: Epoch 012 - tra

In [1282]:
# calculate the values of binary classification metrics on the model's prediction
metrics = calc_binary_classification_metrics(pred_brits, datasets["testing"]["y"])
print(
    "Testing classification metrics: \n"
    f'ROC_AUC: {metrics["roc_auc"]}, \n'
    f'Accuracy: {metrics["accuracy"]},\n'
    f'F1: {metrics["f1"]},\n'
    f'Precision: {metrics["precision"]},\n'
    f'Recall: {metrics["recall"]},\n'
)

Testing classification metrics: 
ROC_AUC: 0.7115059221658206, 
Accuracy: 0.7877551020408163,
F1: 0.3157894736842105,
Precision: 0.42857142857142855,
Recall: 0.25,



## References

You can find the algorithm papers below:
* [Raindrop: Graph-guided Network for Irregularly Sampled Multivariate Time Series](https://arxiv.org/pdf/2110.05357)
* [CSDI: Conditional Score-based Diffusion Model for Probabilistic Timer Series Imputation](https://arxiv.org/pdf/2107.03502)
* [BRITS: Bidirectional Recurrent Imputation for Time Series](https://papers.nips.cc/paper_files/paper/2018/file/734e6bfcd358e25ac1db0a4241b95651-Paper.pdf)

And tools:
* [PyPOTS](https://github.com/WenjieDu/PyPOTS)