<a href="https://colab.research.google.com/github/Balogunhabeeb14/Personal-Projects/blob/main/PyPOTS_tutorials.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 😎 Quick-start Tutorials for PyPOTS are Here!

## Dependency Installation

In [None]:
# install pypots >=0.1
! pip install pypots==0.1.1


Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting pypots==0.1.1
  Downloading pypots-0.1.1-py3-none-any.whl (149 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m150.0/150.0 kB[0m [31m4.3 MB/s[0m eta [36m0:00:00[0m
Collecting pycorruptor (from pypots==0.1.1)
  Downloading pycorruptor-0.0.4-py3-none-any.whl (17 kB)
Collecting tsdb (from pypots==0.1.1)
  Downloading tsdb-0.0.8-py3-none-any.whl (29 kB)
Installing collected packages: tsdb, pycorruptor, pypots
Successfully installed pycorruptor-0.0.4 pypots-0.1.1 tsdb-0.0.8


## 📀 Preparing the **PhysioNet-2012** dataset for this tutorial

In [None]:
from pypots.data.generating import gene_physionet2012
from pypots.utils.random import set_random_seed

set_random_seed()

# Load the PhysioNet-2012 dataset
physionet2012_dataset = gene_physionet2012(artificially_missing_rate=0.1)

# Take a look at the generated PhysioNet-2012 dataset, you'll find that everything has been prepared for you,
# data splitting, normalization, additional artificially-missing values for evaluation, etc.
print(physionet2012_dataset.keys())

2023-05-22 12:24:35 [INFO]: Done. Have already set the random seed as 2204 for numpy and pytorch.
2023-05-22 12:24:35 [INFO]: Loading the dataset physionet_2012 with TSDB (https://github.com/WenjieDu/Time_Series_Database)...
2023-05-22 12:24:35 [INFO]: Starting preprocessing physionet_2012...


Dataset physionet_2012 has already been downloaded. Processing directly...
Dataset physionet_2012 has already been cached. Loading from cache directly...
Loaded successfully!
dict_keys(['n_classes', 'n_steps', 'n_features', 'train_X', 'train_y', 'val_X', 'val_y', 'test_X', 'test_y', 'scaler', 'test_X_intact', 'test_X_indicating_mask', 'val_X_intact', 'val_X_indicating_mask'])


## 🌟 Imputation Models

In [None]:
# Assemble the datasets for training, validating, and testing.

dataset_for_training = {
    "X": physionet2012_dataset['train_X'],
}

dataset_for_validating = {
    "X": physionet2012_dataset['val_X'],
    "X_intact": physionet2012_dataset['val_X_intact'],
    "indicating_mask": physionet2012_dataset['val_X_indicating_mask'],
}

dataset_for_testing = {
    "X": physionet2012_dataset['test_X'],
}


### 🚀 An exmaple of **SAITS** for imputation

In [None]:
from pypots.optim import Adam
from pypots.imputation import SAITS
from pypots.utils.metrics import cal_mae

# initialize the model
saits = SAITS(
    n_steps=physionet2012_dataset['n_steps'],
    n_features=physionet2012_dataset['n_features'],
    n_layers=2,
    d_model=256,
    d_inner=128,
    n_heads=4,
    d_k=64,
    d_v=64,
    dropout=0.1,
    attn_dropout=0.1,
    diagonal_attention_mask=True,  # otherwise the original self-attention mechanism will be applied
    ORT_weight=1,  # you can adjust the weight values of arguments ORT_weight
    # and MIT_weight to make the SAITS model focus more on one task. Usually you can just leave them to the default values, i.e. 1.
    MIT_weight=1,
    batch_size=32,
    # here we set epochs=10 for a quick demo, you can set it to 100 or more for better performance
    epochs=10,
    # here we set patience=3 to early stop the training if the evaluting loss doesn't decrease for 3 epoches.
    # You can leave it to defualt as None to disable early stopping.
    patience=3,
    # give the optimizer. Different from torch.optim.Optimizer, you don't have to specify model's parameters when
    # initializing pypots.optim.Optimizer. You can also leave it to default. It will initilize an Adam optimizer with lr=0.001.
    optimizer=Adam(lr=1e-3),
    # this num_workers argument is for torch.utils.data.Dataloader. It's the number of subprocesses to use for data loading.
    # Leaving it to default as 0 means data loading will be in the main process, i.e. there won't be subprocesses.
    # You can increase it to >1 if you think your dataloading is a bottleneck to your model training speed
    num_workers=0,
    # Set it to None to use the default device (will use CPU if you don't have CUDA devices).
    # You can also set it to 'cpu' or 'cuda' explicitly, or ['cuda:0', 'cuda:1'] if you have multiple CUDA devices.
    device=None,
    # set the path for saving tensorboard and trained model files
    saving_path="tutorial_results/imputation/saits",
    # only save the best model after training finished.
    # You can also set it as "better" to save models performing better ever during training.
    model_saving_strategy="best",
)

# train the model on the training set, and validate it on the validating set to select the best model for testing in the next step
saits.fit(train_set=dataset_for_training, val_set=dataset_for_validating)

# the testing stage, impute the originally-missing values and artificially-missing values in the test set
saits_imputation = saits.impute(dataset_for_testing)

# calculate mean absolute error on the ground truth (artificially-missing values)
testing_mae = cal_mae(
    saits_imputation, physionet2012_dataset['test_X_intact'], physionet2012_dataset['test_X_indicating_mask'])
print("Testing mean absolute error: %.4f" % testing_mae)


2023-05-21 17:35:28 [INFO]: No given device, using default device: cuda
2023-05-21 17:35:28 [INFO]: Model files will be saved to tutorial_results/imputation/saits/20230521_T173528
2023-05-21 17:35:28 [INFO]: Tensorboard file will be saved to tutorial_results/imputation/saits/20230521_T173528/tensorboard
2023-05-21 17:35:28 [INFO]: Model initialized successfully with the number of trainable parameters: 1,378,358
2023-05-21 17:35:35 [INFO]: epoch 0: training loss 0.7098, validating loss 0.3240
2023-05-21 17:35:41 [INFO]: epoch 1: training loss 0.5091, validating loss 0.2987
2023-05-21 17:35:48 [INFO]: epoch 2: training loss 0.4537, validating loss 0.2798
2023-05-21 17:35:55 [INFO]: epoch 3: training loss 0.4150, validating loss 0.2640
2023-05-21 17:36:01 [INFO]: epoch 4: training loss 0.3868, validating loss 0.2486
2023-05-21 17:36:08 [INFO]: epoch 5: training loss 0.3665, validating loss 0.2466
2023-05-21 17:36:16 [INFO]: epoch 6: training loss 0.3529, validating loss 0.2393
2023-05-21 

Testing mean absolute error: 0.2305


### 🚀 An exmaple of **Transformer** for imputation

In [None]:
from pypots.optim import Adam
from pypots.imputation import Transformer
from pypots.utils.metrics import cal_mae

# initialize the model
transformer = Transformer(
    n_steps=physionet2012_dataset['n_steps'],
    n_features=physionet2012_dataset['n_features'],
    n_layers=6,
    d_model=512,
    d_inner=256,
    n_heads=4,
    d_k=128,
    d_v=128,
    dropout=0.1,
    attn_dropout=0,
    ORT_weight=1,  # you can adjust the weight values of arguments ORT_weight
    # and MIT_weight to make the SAITS model focus more on one task. Usually you can just leave them to the default values, i.e. 1.
    MIT_weight=1,
    batch_size=32,
    # here we set epochs=10 for a quick demo, you can set it to 100 or more for better performance
    epochs=10,
    # here we set patience=3 to early stop the training if the evaluting loss doesn't decrease for 3 epoches.
    # You can leave it to defualt as None to disable early stopping.
    patience=3,
    # give the optimizer. Different from torch.optim.Optimizer, you don't have to specify model's parameters when
    # initializing pypots.optim.Optimizer. You can also leave it to default. It will initilize an Adam optimizer with lr=0.001.
    optimizer=Adam(lr=1e-3),
    # this num_workers argument is for torch.utils.data.Dataloader. It's the number of subprocesses to use for data loading.
    # Leaving it to default as 0 means data loading will be in the main process, i.e. there won't be subprocesses.
    # You can increase it to >1 if you think your dataloading is a bottleneck to your model training speed
    num_workers=0,
    # Set it to None to use the default device (will use CPU if you don't have CUDA devices).
    # You can also set it to 'cpu' or 'cuda' explicitly, or ['cuda:0', 'cuda:1'] if you have multiple CUDA devices.
    device=None,
    # set the path for saving tensorboard and trained model files
    saving_path="tutorial_results/imputation/transformer",
    # only save the best model after training finished.
    # You can also set it as "better" to save models performing better ever during training.
    model_saving_strategy="best",
)

# train the model on the training set, and validate it on the validating set to select the best model for testing in the next step
transformer.fit(train_set=dataset_for_training, val_set=dataset_for_validating)

# the testing stage, impute the originally-missing values and artificially-missing values in the test set
transformer_imputation = transformer.impute(dataset_for_testing)

# calculate mean absolute error on the ground truth (artificially-missing values)
testing_mae = cal_mae(
    transformer_imputation,
    physionet2012_dataset['test_X_intact'],
    physionet2012_dataset['test_X_indicating_mask']
)
print("Testing mean absolute error: %.4f" % testing_mae)


2023-05-21 17:32:12 [INFO]: No given device, using default device: cuda
2023-05-21 17:32:12 [INFO]: Model files will be saved to tutorial_results/imputation/transformer/20230521_T173212
2023-05-21 17:32:12 [INFO]: Tensorboard file will be saved to tutorial_results/imputation/transformer/20230521_T173212/tensorboard
2023-05-21 17:32:19 [INFO]: Model initialized successfully with the number of trainable parameters: 7,938,597
2023-05-21 17:32:31 [INFO]: epoch 0: training loss 0.8080, validating loss 0.5966
2023-05-21 17:32:40 [INFO]: epoch 1: training loss 0.6327, validating loss 0.5566
2023-05-21 17:32:49 [INFO]: epoch 2: training loss 0.5875, validating loss 0.5367
2023-05-21 17:32:59 [INFO]: epoch 3: training loss 0.5696, validating loss 0.5303
2023-05-21 17:33:08 [INFO]: epoch 4: training loss 0.5617, validating loss 0.5346
2023-05-21 17:33:17 [INFO]: epoch 5: training loss 0.5543, validating loss 0.5240
2023-05-21 17:33:27 [INFO]: epoch 6: training loss 0.5510, validating loss 0.5108

Testing mean absolute error: 0.5056


### 🚀 An exmaple of **BRITS** for imputation

In [None]:
from pypots.optim import Adam
from pypots.imputation import BRITS
from pypots.utils.metrics import cal_mae

# initialize the model
# initialize the model
brits = BRITS(
    n_steps=physionet2012_dataset['n_steps'],
    n_features=physionet2012_dataset['n_features'],
    rnn_hidden_size=128,
    batch_size=32,
    # here we set epochs=10 for a quick demo, you can set it to 100 or more for better performance
    epochs=10,
    # here we set patience=3 to early stop the training if the evaluting loss doesn't decrease for 3 epoches.
    # You can leave it to defualt as None to disable early stopping.
    patience=3,
    # give the optimizer. Different from torch.optim.Optimizer, you don't have to specify model's parameters when
    # initializing pypots.optim.Optimizer. You can also leave it to default. It will initilize an Adam optimizer with lr=0.001.
    optimizer=Adam(lr=1e-3),
    # this num_workers argument is for torch.utils.data.Dataloader. It's the number of subprocesses to use for data loading.
    # Leaving it to default as 0 means data loading will be in the main process, i.e. there won't be subprocesses.
    # You can increase it to >1 if you think your dataloading is a bottleneck to your model training speed
    num_workers=0,
    # Set it to None to use the default device (will use CPU if you don't have CUDA devices).
    # You can also set it to 'cpu' or 'cuda' explicitly, or ['cuda:0', 'cuda:1'] if you have multiple CUDA devices.
    device=None,
    # set the path for saving tensorboard and trained model files
    saving_path="tutorial_results/imputation/brits",
    # only save the best model after training finished.
    # You can also set it as "better" to save models performing better ever during training.
    model_saving_strategy="best",
)

# train the model on the training set, and validate it on the validating set to select the best model for testing in the next step
brits.fit(train_set=dataset_for_training, val_set=dataset_for_validating)

# the testing stage, impute the originally-missing values and artificially-missing values in the test set
brits_imputation = brits.impute(dataset_for_testing)

# calculate mean absolute error on the ground truth (artificially-missing values)
testing_mae = cal_mae(
    brits_imputation,
    physionet2012_dataset['test_X_intact'],
    physionet2012_dataset['test_X_indicating_mask']
)
print("Testing mean absolute error: %.4f" % testing_mae)


2023-05-21 17:37:39 [INFO]: No given device, using default device: cuda
2023-05-21 17:37:39 [INFO]: Model files will be saved to tutorial_results/imputation/brits/20230521_T173739
2023-05-21 17:37:39 [INFO]: Tensorboard file will be saved to tutorial_results/imputation/brits/20230521_T173739/tensorboard
2023-05-21 17:37:39 [INFO]: Model initialized successfully with the number of trainable parameters: 239,344
2023-05-21 17:39:03 [INFO]: epoch 0: training loss 0.9475, validating loss 0.3534
2023-05-21 17:40:02 [INFO]: epoch 1: training loss 0.7369, validating loss 0.3107
2023-05-21 17:41:01 [INFO]: epoch 2: training loss 0.6845, validating loss 0.2903
2023-05-21 17:42:00 [INFO]: epoch 3: training loss 0.6596, validating loss 0.2800
2023-05-21 17:42:59 [INFO]: epoch 4: training loss 0.6443, validating loss 0.2738
2023-05-21 17:43:57 [INFO]: epoch 5: training loss 0.6329, validating loss 0.2691
2023-05-21 17:44:56 [INFO]: epoch 6: training loss 0.6238, validating loss 0.2660
2023-05-21 17

Testing mean absolute error: 0.2555


### 🚀 An exmaple of **M-RNN** for imputation

In [None]:
from pypots.optim import Adam
from pypots.imputation import MRNN
from pypots.utils.metrics import cal_mae

# initialize the model
# initialize the model
mrnn = MRNN(
    n_steps=physionet2012_dataset['n_steps'],
    n_features=physionet2012_dataset['n_features'],
    rnn_hidden_size=128,
    batch_size=32,
    # here we set epochs=10 for a quick demo, you can set it to 100 or more for better performance
    epochs=10,
    # here we set patience=3 to early stop the training if the evaluting loss doesn't decrease for 3 epoches.
    # You can leave it to defualt as None to disable early stopping.
    patience=3,
    # give the optimizer. Different from torch.optim.Optimizer, you don't have to specify model's parameters when
    # initializing pypots.optim.Optimizer. You can also leave it to default. It will initilize an Adam optimizer with lr=0.001.
    optimizer=Adam(lr=1e-3),
    # this num_workers argument is for torch.utils.data.Dataloader. It's the number of subprocesses to use for data loading.
    # Leaving it to default as 0 means data loading will be in the main process, i.e. there won't be subprocesses.
    # You can increase it to >1 if you think your dataloading is a bottleneck to your model training speed
    num_workers=0,
    # Set it to None to use the default device (will use CPU if you don't have CUDA devices).
    # You can also set it to 'cpu' or 'cuda' explicitly, or ['cuda:0', 'cuda:1'] if you have multiple CUDA devices.
    device=None,
    # set the path for saving tensorboard and trained model files
    saving_path="tutorial_results/imputation/mrnn",
    # only save the best model after training finished.
    # You can also set it as "better" to save models performing better ever during training.
    model_saving_strategy="best",
)

# train the model on the training set, and validate it on the validating set to select the best model for testing in the next step
mrnn.fit(train_set=dataset_for_training, val_set=dataset_for_validating)

# the testing stage, impute the originally-missing values and artificially-missing values in the test set
mrnn_imputation = mrnn.impute(dataset_for_testing)

# calculate mean absolute error on the ground truth (artificially-missing values)
testing_mae = cal_mae(
    mrnn_imputation,
    physionet2012_dataset['test_X_intact'],
    physionet2012_dataset['test_X_indicating_mask']
)
print("Testing mean absolute error: %.4f" % testing_mae)


2023-05-21 17:48:06 [INFO]: No given device, using default device: cuda
2023-05-21 17:48:06 [INFO]: Model files will be saved to tutorial_results/imputation/mrnn/20230521_T174806
2023-05-21 17:48:06 [INFO]: Tensorboard file will be saved to tutorial_results/imputation/mrnn/20230521_T174806/tensorboard
2023-05-21 17:48:06 [INFO]: Model initialized successfully with the number of trainable parameters: 265,939
2023-05-21 17:48:54 [INFO]: epoch 0: training loss 1.0076, validating loss 0.6060
2023-05-21 17:49:18 [INFO]: epoch 1: training loss 0.4206, validating loss 0.6812
2023-05-21 17:49:43 [INFO]: epoch 2: training loss 0.3212, validating loss 0.7078
2023-05-21 17:50:07 [INFO]: epoch 3: training loss 0.2673, validating loss 0.7258
2023-05-21 17:50:07 [INFO]: Exceeded the training patience. Terminating the training procedure...
2023-05-21 17:50:07 [INFO]: Finished training.
2023-05-21 17:50:07 [INFO]: Saved the model to tutorial_results/imputation/mrnn/20230521_T174806/MRNN.pypots.


Testing mean absolute error: 0.7208


### 🚀 An exmaple of **LOCF** for imputation

In [None]:
from pypots.imputation import LOCF
from pypots.utils.metrics import cal_mae

# initialize the model
locf = LOCF(
    nan=0  # set the value used to impute data missing at the beginning of the sequence, those cannot use LOCF mechanism to impute
)

# LOCF doesn't need to be trained, just call the impute() function
locf.fit(train_set=dataset_for_training, val_set=dataset_for_validating)

# the testing stage, impute the originally-missing values and artificially-missing values in the test set
locf_imputation = locf.impute(dataset_for_testing)

# calculate mean absolute error on the ground truth (artificially-missing values)
testing_mae = cal_mae(
    locf_imputation,
    physionet2012_dataset['test_X_intact'],
    physionet2012_dataset['test_X_indicating_mask']
)
print("Testing mean absolute error: %.4f" % testing_mae)


2023-05-21 17:50:35 [INFO]: saving_path not given. Model files and tensorboard file will not be saved.


Testing mean absolute error: 0.4079




## 🌟 Clustering Models

In [None]:
# Assemble the datasets for training, validating, and testing.
import numpy as np

# don't need validation set
dataset_for_training = {
    "X": np.concatenate([physionet2012_dataset['train_X'], physionet2012_dataset['val_X']], axis=0),
    "y": np.concatenate([physionet2012_dataset['train_y'], physionet2012_dataset['val_y']], axis=0),
}

dataset_for_testing = {
    "X": physionet2012_dataset['test_X'],
    "y": physionet2012_dataset['test_y'],
}


### 🚀 An exmaple of **CRLI** for clustering

In [None]:
from pypots.optim import Adam
from pypots.clustering import CRLI
from pypots.utils.metrics import cal_rand_index, cal_cluster_purity

# initialize the model
crli = CRLI(
    n_steps=physionet2012_dataset["n_steps"],
    n_features=physionet2012_dataset["n_features"],
    n_clusters=physionet2012_dataset["n_classes"],
    n_generator_layers=2,
    rnn_hidden_size=256,
    rnn_cell_type="GRU",
    decoder_fcn_output_dims=[256, 128],  # the output dimensions of layers in the decoder FCN.
    # Here means there are 3 layers. Leave it to default as None will results in
    # the FCN haveing only one layer.
    batch_size=32,
    # here we set epochs=10 for a quick demo, you can set it to 100 or more for better performance
    epochs=10,
    # here we set patience=3 to early stop the training if the evaluting loss doesn't decrease for 3 epoches.
    # You can leave it to defualt as None to disable early stopping.
    patience=3,
    # give the optimizer. Different from torch.optim.Optimizer, you don't have to specify model's parameters when
    # initializing pypots.optim.Optimizer. You can also leave it to default. It will initilize an Adam optimizer with lr=0.001.
    G_optimizer=Adam(lr=1e-3),
    D_optimizer=Adam(lr=1e-3),
    # this num_workers argument is for torch.utils.data.Dataloader. It's the number of subprocesses to use for data loading.
    # Leaving it to default as 0 means data loading will be in the main process, i.e. there won't be subprocesses.
    # You can increase it to >1 if you think your dataloading is a bottleneck to your model training speed
    num_workers=0,
    # Set it to None to use the default device (will use CPU if you don't have CUDA devices).
    # You can also set it to 'cpu' or 'cuda' explicitly, or ['cuda:0', 'cuda:1'] if you have multiple CUDA devices.
    device=None,
    # set the path for saving tensorboard and trained model files
    saving_path="tutorial_results/clustering/crli",
    # only save the best model after training finished.
    # You can also set it as "better" to save models performing better ever during training.
    model_saving_strategy="best",
)

# train the model on the training set, and validate it on the validating set to select the best model for testing in the next step
crli.fit(train_set=dataset_for_training)

# the testing stage, impute the originally-missing values and artificially-missing values in the test set
crli_prediction = crli.cluster(dataset_for_testing)

# calculate mean absolute error on the ground truth (artificially-missing values)
RI = cal_rand_index(crli_prediction, dataset_for_testing["y"])
CP = cal_cluster_purity(crli_prediction, dataset_for_testing["y"])
print(
    "Testing clustering metrics: \n"
    f'RI: {RI}, \n'
    f'CP: {CP}\n'
)


2023-05-21 17:52:10 [INFO]: No given device, using default device: cuda
2023-05-21 17:52:10 [INFO]: Model files will be saved to tutorial_results/clustering/crli/20230521_T175210
2023-05-21 17:52:10 [INFO]: Tensorboard file will be saved to tutorial_results/clustering/crli/20230521_T175210/tensorboard
2023-05-21 17:52:10 [INFO]: Model initialized successfully with the number of trainable parameters: 1,546,820
2023-05-21 17:53:37 [INFO]: epoch 0: training loss_generator 3.3941, train loss_discriminator 0.3881
2023-05-21 17:55:01 [INFO]: epoch 1: training loss_generator 3.4165, train loss_discriminator 0.3679
2023-05-21 17:56:25 [INFO]: epoch 2: training loss_generator 3.4143, train loss_discriminator 0.3492
2023-05-21 17:57:48 [INFO]: epoch 3: training loss_generator 9.8183, train loss_discriminator 0.3325
2023-05-21 17:57:48 [INFO]: Exceeded the training patience. Terminating the training procedure...
2023-05-21 17:57:48 [INFO]: Finished training.
2023-05-21 17:57:48 [INFO]: Saved the 

Testing clustering metrics: 
RI: 0.4999754697542069, 
CP: 0.8586321934945789





### 🚀 An exmaple of **VaDER** for clustering

In [None]:
from pypots.optim import Adam
from pypots.clustering import VaDER
from pypots.utils.metrics import cal_rand_index, cal_cluster_purity

# initialize the model
vader = VaDER(
    n_steps=physionet2012_dataset["n_steps"],
    n_features=physionet2012_dataset["n_features"],
    n_clusters=physionet2012_dataset["n_classes"],
    rnn_hidden_size=128,
    d_mu_stddev=2,
    pretrain_epochs=20,
    batch_size=32,
    # here we set epochs=10 for a quick demo, you can set it to 100 or more for better performance
    epochs=10,
    # here we set patience=3 to early stop the training if the evaluting loss doesn't decrease for 3 epoches.
    # You can leave it to defualt as None to disable early stopping.
    patience=3,
    # give the optimizer. Different from torch.optim.Optimizer, you don't have to specify model's parameters when
    # initializing pypots.optim.Optimizer. You can also leave it to default. It will initilize an Adam optimizer with lr=0.001.
    optimizer=Adam(lr=1e-3),
    # this num_workers argument is for torch.utils.data.Dataloader. It's the number of subprocesses to use for data loading.
    # Leaving it to default as 0 means data loading will be in the main process, i.e. there won't be subprocesses.
    # You can increase it to >1 if you think your dataloading is a bottleneck to your model training speed
    num_workers=0,
    # Set it to None to use the default device (will use CPU if you don't have CUDA devices).
    # You can also set it to 'cpu' or 'cuda' explicitly, or ['cuda:0', 'cuda:1'] if you have multiple CUDA devices.
    device=None,
    # set the path for saving tensorboard and trained model files
    saving_path="tutorial_results/clustering/vader",
    # only save the best model after training finished.
    # You can also set it as "better" to save models performing better ever during training.
    model_saving_strategy="best",
)

# train the model on the training set, and validate it on the validating set to select the best model for testing in the next step
vader.fit(train_set=dataset_for_training)

# the testing stage, impute the originally-missing values and artificially-missing values in the test set
vader_prediction = vader.cluster(dataset_for_testing)

# calculate mean absolute error on the ground truth (artificially-missing values)
RI = cal_rand_index(vader_prediction, dataset_for_testing["y"])
CP = cal_cluster_purity(vader_prediction, dataset_for_testing["y"])
print(
    "Testing clustering metrics: \n"
    f'RI: {RI}, \n'
    f'CP: {CP}\n'
)


2023-05-21 18:53:57 [INFO]: No given device, using default device: cuda
2023-05-21 18:53:57 [INFO]: Model files will be saved to tutorial_results/clustering/vader/20230521_T185357
2023-05-21 18:53:57 [INFO]: Tensorboard file will be saved to tutorial_results/clustering/vader/20230521_T185357/tensorboard
2023-05-21 18:53:57 [INFO]: Model initialized successfully with the number of trainable parameters: 293,644
2023-05-21 19:08:50 [INFO]: epoch 0: training loss 1.5763
2023-05-21 19:09:32 [INFO]: epoch 1: training loss 1.0839
2023-05-21 19:10:09 [INFO]: epoch 2: training loss 1.0469
2023-05-21 19:10:46 [INFO]: epoch 3: training loss 1.0383
2023-05-21 19:11:23 [INFO]: epoch 4: training loss 1.0328
2023-05-21 19:12:01 [INFO]: epoch 5: training loss 1.0359
2023-05-21 19:12:41 [INFO]: epoch 6: training loss 1.0371
2023-05-21 19:13:18 [INFO]: epoch 7: training loss 1.0402
2023-05-21 19:13:18 [INFO]: Exceeded the training patience. Terminating the training procedure...
2023-05-21 19:13:18 [INFO

Testing clustering metrics: 
RI: 0.7500013048003081, 
CP: 0.853628023352794



## 🌟 Forecasting Models

In [None]:
# Assemble the datasets for training, validating, and testing.

dataset_for_training = {
    "X": physionet2012_dataset['train_X'],
}

dataset_for_validating = {
    "X": physionet2012_dataset['val_X'],
    "X_intact": physionet2012_dataset['val_X_intact'],
    "indicating_mask": physionet2012_dataset['val_X_indicating_mask'],
}

dataset_for_testing = {
    "X": physionet2012_dataset['test_X'][:, :36],  # we only take the first 36 steps for model input,
    # and let the model to forecast the left 12 steps
}


### 🚀 An exmaple of **BTTF** for forecasting

In [None]:
from pypots.forecasting import BTTF
from pypots.utils.metrics import cal_mae

# initialize the model
bttf = BTTF(
    36,
    physionet2012_dataset["n_features"],
    pred_step=12,
    rank=10,
    time_lags=[1, 2, 3, 10, 10 + 1, 10 + 2, 20, 20 + 1, 20 + 2],
    burn_iter=5,
    gibbs_iter=5,
    multi_step=1,
)

# train the model on the training set, and validate it on the validating set to select the best model for testing in the next step
bttf.fit(train_set=dataset_for_training, val_set=dataset_for_validating)
# BTTF does not need to run func fits().

# the testing stage, impute the originally-missing values and artificially-missing values in the test set
bttf_forecasting_results = bttf.forecast(dataset_for_testing)

# calculate mean absolute error on the ground truth (artificially-missing values)
testing_mae = cal_mae(
    bttf_forecasting_results,
    np.nan_to_num(physionet2012_dataset['test_X'][:, 36:]),
    (~np.isnan(physionet2012_dataset['test_X'][:, 36:])).astype(int),
)
print("Testing mean absolute error: %.4f" % testing_mae)


2023-05-21 18:14:45 [INFO]: No given device, using default device: cuda
2023-05-21 18:14:45 [INFO]: saving_path not given. Model files and tensorboard file will not be saved.


Testing mean absolute error: 1.2239


## 🌟 Classification Models

In [None]:
# Assemble the datasets for training, validating, and testing.

dataset_for_training = {
    "X": physionet2012_dataset['train_X'],
    "y": physionet2012_dataset['train_y'],
}

dataset_for_validating = {
    "X": physionet2012_dataset['val_X'],
    "y": physionet2012_dataset['val_y'],
}

dataset_for_testing = {
    "X": physionet2012_dataset['test_X'],
    "y": physionet2012_dataset['test_y'],
}

### 🚀 An exmaple of **BRITS** for classification

In [None]:
from pypots.optim import Adam
from pypots.classification import BRITS
from pypots.utils.metrics import cal_binary_classification_metrics

# initialize the model
from pypots.optim import Adam
from pypots.classification import BRITS

# initialize the model
brits = BRITS(
    n_steps=physionet2012_dataset['n_steps'],
    n_features=physionet2012_dataset['n_features'],
    n_classes=physionet2012_dataset["n_classes"],
    rnn_hidden_size=256,
    batch_size=32,
    # here we set epochs=10 for a quick demo, you can set it to 100 or more for better performance
    epochs=10,
    # here we set patience=3 to early stop the training if the evaluting loss doesn't decrease for 3 epoches.
    # You can leave it to defualt as None to disable early stopping.
    patience=3,
    # give the optimizer. Different from torch.optim.Optimizer, you don't have to specify model's parameters when
    # initializing pypots.optim.Optimizer. You can also leave it to default. It will initilize an Adam optimizer with lr=0.001.
    optimizer=Adam(lr=1e-3),
    # this num_workers argument is for torch.utils.data.Dataloader. It's the number of subprocesses to use for data loading.
    # Leaving it to default as 0 means data loading will be in the main process, i.e. there won't be subprocesses.
    # You can increase it to >1 if you think your dataloading is a bottleneck to your model training speed
    num_workers=0,
    # Set it to None to use the default device (will use CPU if you don't have CUDA devices).
    # You can also set it to 'cpu' or 'cuda' explicitly, or ['cuda:0', 'cuda:1'] if you have multiple CUDA devices.
    device=None,
    # set the path for saving tensorboard and trained model files
    saving_path="tutorial_results/classification/brits",
    # only save the best model after training finished.
    # You can also set it as "better" to save models performing better ever during training.
    model_saving_strategy="best",
)

# train the model on the training set, and validate it on the validating set to select the best model for testing in the next step
brits.fit(train_set=dataset_for_training, val_set=dataset_for_validating)

# the testing stage, impute the originally-missing values and artificially-missing values in the test set
brits_prediction = brits.classify(dataset_for_testing)

# calculate mean absolute error on the ground truth (artificially-missing values)
metrics = cal_binary_classification_metrics(brits_prediction, dataset_for_testing["y"])
print("Testing classification metrics: \n"
    f'ROC_AUC: {metrics["roc_auc"]}, \n'
    f'PR_AUC: {metrics["pr_auc"]},\n'
    f'F1: {metrics["f1"]},\n'
    f'Precision: {metrics["precision"]},\n'
    f'Recall: {metrics["recall"]},\n'
)

2023-05-21 18:42:05 [ERROR]: cannot import name 'ObservationPropagation' from 'pypots.classification.raindrop.modules' (/usr/local/lib/python3.10/dist-packages/pypots/classification/raindrop/modules.py)
torch_geometric is missing, please install it with 'pip install torch_geometric' or 'conda install -c pyg pyg'
2023-05-21 18:42:05 [INFO]: No given device, using default device: cuda
2023-05-21 18:42:05 [INFO]: Model files will be saved to tutorial_results/classification/brits/20230521_T184205
2023-05-21 18:42:05 [INFO]: Tensorboard file will be saved to tutorial_results/classification/brits/20230521_T184205/tensorboard
2023-05-21 18:42:08 [INFO]: Model initialized successfully with the number of trainable parameters: 730,612
2023-05-21 18:43:31 [INFO]: epoch 0: training loss 0.9122, validating loss 0.8039
2023-05-21 18:44:32 [INFO]: epoch 1: training loss 0.7784, validating loss 0.7490
2023-05-21 18:45:31 [INFO]: epoch 2: training loss 0.7265, validating loss 0.7239
2023-05-21 18:46:31

Testing classification metrics: 
ROC_AUC: 0.8344878266715101, 
PR_AUC: 0.46329618945884665,
F1: 0.31865828092243187,
Precision: 0.6031746031746031,
Recall: 0.21652421652421652,



### 🚀 An exmaple of **GRUD** for classification

In [None]:
from pypots.optim import Adam
from pypots.classification import GRUD
from pypots.utils.metrics import cal_binary_classification_metrics

# initialize the model
from pypots.optim import Adam
from pypots.classification import BRITS

# initialize the model
grud = GRUD(
    n_steps=physionet2012_dataset['n_steps'],
    n_features=physionet2012_dataset['n_features'],
    n_classes=physionet2012_dataset["n_classes"],
    rnn_hidden_size=32,
    batch_size=32,
    # here we set epochs=10 for a quick demo, you can set it to 100 or more for better performance
    epochs=10,
    # here we set patience=3 to early stop the training if the evaluting loss doesn't decrease for 3 epoches.
    # You can leave it to defualt as None to disable early stopping.
    patience=3,
    # give the optimizer. Different from torch.optim.Optimizer, you don't have to specify model's parameters when
    # initializing pypots.optim.Optimizer. You can also leave it to default. It will initilize an Adam optimizer with lr=0.001.
    optimizer=Adam(lr=1e-3),
    # this num_workers argument is for torch.utils.data.Dataloader. It's the number of subprocesses to use for data loading.
    # Leaving it to default as 0 means data loading will be in the main process, i.e. there won't be subprocesses.
    # You can increase it to >1 if you think your dataloading is a bottleneck to your model training speed
    num_workers=0,
    # Set it to None to use the default device (will use CPU if you don't have CUDA devices).
    # You can also set it to 'cpu' or 'cuda' explicitly, or ['cuda:0', 'cuda:1'] if you have multiple CUDA devices.
    device=None,
    # set the path for saving tensorboard and trained model files
    saving_path="tutorial_results/classification/grud",
    # only save the best model after training finished.
    # You can also set it as "better" to save models performing better ever during training.
    model_saving_strategy="best",
)

# train the model on the training set, and validate it on the validating set to select the best model for testing in the next step
grud.fit(train_set=dataset_for_training, val_set=dataset_for_validating)

# the testing stage, impute the originally-missing values and artificially-missing values in the test set
grud_prediction = grud.classify(dataset_for_testing)

# calculate mean absolute error on the ground truth (artificially-missing values)
metrics = cal_binary_classification_metrics(grud_prediction, dataset_for_testing["y"])
print("Testing classification metrics: \n"
    f'ROC_AUC: {metrics["roc_auc"]}, \n'
    f'PR_AUC: {metrics["pr_auc"]},\n'
    f'F1: {metrics["f1"]},\n'
    f'Precision: {metrics["precision"]},\n'
    f'Recall: {metrics["recall"]},\n'
)

2023-05-22 12:15:28 [ERROR]: No module named 'torch_geometric'
torch_geometric is missing, please install it with 'pip install torch_geometric' or 'conda install -c pyg pyg'
2023-05-22 12:15:28 [ERROR]: name 'MessagePassing' is not defined
Note torch_geometric is missing, please install it with 'pip install torch_geometric' or 'conda install -c pyg pyg'
2023-05-22 12:15:28 [INFO]: No given device, using default device: cuda
2023-05-22 12:15:28 [INFO]: Model files will be saved to tutorial_results/classification/grud/20230522_T121528
2023-05-22 12:15:28 [INFO]: Tensorboard file will be saved to tutorial_results/classification/grud/20230522_T121528/tensorboard
2023-05-22 12:15:35 [INFO]: Model initialized successfully with the number of trainable parameters: 16,128
2023-05-22 12:15:35 [INFO]: saving_path not given. Model files and tensorboard file will not be saved.
2023-05-22 12:15:46 [INFO]: saving_path not given. Model files and tensorboard file will not be saved.
2023-05-22 12:16:06 

Testing classification metrics: 
ROC_AUC: 0.8124554156025495, 
PR_AUC: 0.4419410911679675,
F1: 0.39626168224299063,
Precision: 0.53,
Recall: 0.3164179104477612,



### 🚀 An exmaple of **Raindrop** for classification

In [None]:
import torch

print(torch.__version__)

2.0.1+cu118


In [None]:
# install necessary dependencies for Raindrop
! pip install torch-geometric torch-scatter torch-sparse -f "https://data.pyg.org/whl/torch-2.0.0+cu118.html"


Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in links: https://data.pyg.org/whl/torch-2.0.0+cu118.html
Collecting torch-geometric
  Using cached torch_geometric-2.3.1-py3-none-any.whl
Collecting torch-scatter
  Downloading https://data.pyg.org/whl/torch-2.0.0%2Bcu118/torch_scatter-2.1.1%2Bpt20cu118-cp310-cp310-linux_x86_64.whl (10.2 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m10.2/10.2 MB[0m [31m49.6 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting torch-sparse
  Downloading https://data.pyg.org/whl/torch-2.0.0%2Bcu118/torch_sparse-0.6.17%2Bpt20cu118-cp310-cp310-linux_x86_64.whl (4.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m4.8/4.8 MB[0m [31m57.4 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: torch-scatter, torch-sparse, torch-geometric
Successfully installed torch-geometric-2.3.1 torch-scatter-2.1.1+pt20cu118 torch-sparse-0.6.17+pt20cu118


In [None]:
from pypots.optim import Adam
from pypots.classification import Raindrop
from pypots.utils.metrics import cal_binary_classification_metrics

# initialize the model
raindrop = Raindrop(
    n_steps=physionet2012_dataset['n_steps'],
    n_features=physionet2012_dataset['n_features'],
    n_classes=physionet2012_dataset["n_classes"],
    n_layers=2,
    d_model=physionet2012_dataset["n_features"] * 4,
    d_inner=256,
    n_heads=2,
    dropout=0.3,
    batch_size=32,
    # here we set epochs=10 for a quick demo, you can set it to 100 or more for better performance
    epochs=10,
    # here we set patience=3 to early stop the training if the evaluting loss doesn't decrease for 3 epoches.
    # You can leave it to defualt as None to disable early stopping.
    patience=3,
    # give the optimizer. Different from torch.optim.Optimizer, you don't have to specify model's parameters when
    # initializing pypots.optim.Optimizer. You can also leave it to default. It will initilize an Adam optimizer with lr=0.001.
    optimizer=Adam(lr=1e-3),
    # this num_workers argument is for torch.utils.data.Dataloader. It's the number of subprocesses to use for data loading.
    # Leaving it to default as 0 means data loading will be in the main process, i.e. there won't be subprocesses.
    # You can increase it to >1 if you think your dataloading is a bottleneck to your model training speed
    num_workers=0,
    # Set it to None to use the default device (will use CPU if you don't have CUDA devices).
    # You can also set it to 'cpu' or 'cuda' explicitly, or ['cuda:0', 'cuda:1'] if you have multiple CUDA devices.
    device=None,
    # set the path for saving tensorboard and trained model files
    saving_path="tutorial_results/classification/raindrop",
    model_saving_strategy="best", # only save the best model after training finished.
                                  # You can also set it as "better" to save models performing better ever during training.
)

# train the model on the training set, and validate it on the validating set to select the best model for testing in the next step
raindrop.fit(train_set=dataset_for_training, val_set=dataset_for_validating)

# the testing stage, impute the originally-missing values and artificially-missing values in the test set
raindrop_prediction = raindrop.classify(dataset_for_testing)

# calculate mean absolute error on the ground truth (artificially-missing values)
metrics = cal_binary_classification_metrics(raindrop_prediction, dataset_for_testing["y"])
print("Testing classification metrics: \n"
    f'ROC_AUC: {metrics["roc_auc"]}, \n'
    f'PR_AUC: {metrics["pr_auc"]},\n'
    f'F1: {metrics["f1"]},\n'
    f'Precision: {metrics["precision"]},\n'
    f'Recall: {metrics["recall"]},\n'
)

2023-05-22 12:25:24 [INFO]: No given device, using default device: cuda
2023-05-22 12:25:24 [INFO]: Model files will be saved to tutorial_results/classification/raindrop/20230522_T122524
2023-05-22 12:25:24 [INFO]: Tensorboard file will be saved to tutorial_results/classification/raindrop/20230522_T122524/tensorboard
2023-05-22 12:25:26 [INFO]: Model initialized successfully with the number of trainable parameters: 1,415,006
2023-05-22 12:25:26 [INFO]: saving_path not given. Model files and tensorboard file will not be saved.
2023-05-22 12:25:37 [INFO]: saving_path not given. Model files and tensorboard file will not be saved.
2023-05-22 12:26:05 [INFO]: epoch 0: training loss 0.3798, validating loss 0.3416
2023-05-22 12:26:30 [INFO]: epoch 1: training loss 0.3387, validating loss 0.3300
2023-05-22 12:26:55 [INFO]: epoch 2: training loss 0.3194, validating loss 0.3364
2023-05-22 12:27:21 [INFO]: epoch 3: training loss 0.3095, validating loss 0.3104
2023-05-22 12:27:47 [INFO]: epoch 4: 

Testing classification metrics: 
ROC_AUC: 0.8478756743219553, 
PR_AUC: 0.5212918602764244,
F1: 0.2418604651162791,
Precision: 0.7761194029850746,
Recall: 0.14325068870523416,

