# Latency and memory

## How to run this file on the lisa cluster

Start by pulling the lastest version of the repository, if necessary re-add the ssh-key to the agent.  ([See here](https://docs.github.com/en/authentication/connecting-to-github-with-ssh/generating-a-new-ssh-key-and-adding-it-to-the-ssh-agent?platform=linux)), i.e.

```sh
eval "$(ssh-agent -s)"
ssh-add ~/.ssh/id_ed25519
```

Then activate the correct modules (`2022` and `Anaconda3/2022.05`) and activate the `gvp` source.  Finally, actually run the notebook with 

```sh
jupyter nbconvert "latency and memory.ipynb" --to notebook --execute --inplace --allow-errors
```

Make sure the environment contains the following packages: `ipywidgets`, `jupyter`, `notebook`, `torch`, `gvp`

## Setup

* Import things necessary for running GVP and SMLP, 
* Setup which task we analyse, 
* ...

### Imports

In [1]:
# Supresses the following warning:
#   UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class.
#   This should only matter to you if you are using storages directly.
#   To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
import warnings
warnings.filterwarnings("ignore", category=UserWarning)

In [2]:
import time
import torch
import torch.nn as nn

from run_atom3d import get_datasets, get_model
from torch.profiler import profile, record_function, ProfilerActivity

### Task setup

In [3]:
DATA_DIR = "../data/"

device = 'cuda' if torch.cuda.is_available() else 'cpu'
if device != "cuda":
    exit()

TASK = "LBA"
"""Task to test for model latency."""
LBA_SPLIT = 30
# SPLIT = 60

DATASET_SPLIT = "train"
"""Dataset partition from which the sample is chosen."""
# Select from these values
DATASET_SPLIT = ["train", "val", "test"].index(DATASET_SPLIT)
SAMPLE_INDEX = 10
"""Dataset index to use for testing the latency and memory."""

BEST_GVP_MODEL = "./best_models/LBA_lba-split=30_47.pt"
"""Path to the best trained relevant model."""

'Path to the best trained relevant model.'

In [4]:
datasets = get_datasets(TASK, DATA_DIR, LBA_SPLIT)

gvp_model = get_model(TASK).to(device)
# gvp_model = nn.DataParallel(gvp_model)  # Add parallel to fix OOM issues?
gvp_model.load_state_dict(torch.load(BEST_GVP_MODEL), strict=False)

dataset_sample = datasets[DATASET_SPLIT][SAMPLE_INDEX].to(device)
# This is required for taking the scattermean of the graph
dataset_sample.batch = torch.zeros_like(dataset_sample.atoms)

print(dataset_sample)

Data(x=[551, 3], edge_index=[2, 13194], atoms=[551], edge_s=[13194, 16], edge_v=[13194, 1, 3], label=6.85, lig_flag=[551], batch=[551])


In [5]:
# Burn once to prevent extra overhead for first CUDA profiling
with profile(activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA],
             profile_memory=True) as prof:
    gvp_model(dataset_sample)

STAGE:2023-05-24 14:51:20 28578:28578 ActivityProfilerController.cpp:311] Completed Stage: Warm Up
STAGE:2023-05-24 14:51:22 28578:28578 ActivityProfilerController.cpp:317] Completed Stage: Collection
STAGE:2023-05-24 14:51:22 28578:28578 ActivityProfilerController.cpp:321] Completed Stage: Post Processing


## Latency

### GVP

In [6]:
def test_latency(model, dataset_sample):
    start = time.perf_counter()
    out = model(dataset_sample)
    end = time.perf_counter()
    return end - start

In [7]:
gvp_model.train()
# withgrad_latency = test_latency(gvp_model, dataset_sample)
with profile(activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA],
             profile_memory=True) as prof_train:
    out = gvp_model(dataset_sample)


gvp_model.eval()
with torch.no_grad():
    # gradless_latency = test_latency(gvp_model, dataset_sample)
    with profile(activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA],
                profile_memory=True) as prof_test:
        out = gvp_model(dataset_sample)

# print(f"{withgrad_latency=}")
# print(f"{gradless_latency=}")

STAGE:2023-05-24 14:51:22 28578:28578 ActivityProfilerController.cpp:311] Completed Stage: Warm Up
STAGE:2023-05-24 14:51:22 28578:28578 ActivityProfilerController.cpp:317] Completed Stage: Collection
STAGE:2023-05-24 14:51:22 28578:28578 ActivityProfilerController.cpp:321] Completed Stage: Post Processing
STAGE:2023-05-24 14:51:22 28578:28578 ActivityProfilerController.cpp:311] Completed Stage: Warm Up
STAGE:2023-05-24 14:51:22 28578:28578 ActivityProfilerController.cpp:317] Completed Stage: Collection
STAGE:2023-05-24 14:51:22 28578:28578 ActivityProfilerController.cpp:321] Completed Stage: Post Processing


In [8]:
print(prof_train.key_averages().table(sort_by="cuda_time_total", row_limit=10, top_level_events_only=True))
print(prof_train.key_averages().table(sort_by="self_cuda_memory_usage", row_limit=10, top_level_events_only=True))
print()
print(prof_test.key_averages().table(sort_by="cuda_time_total", row_limit=10, top_level_events_only=True))
print(prof_test.key_averages().table(sort_by="self_cuda_memory_usage", row_limit=10, top_level_events_only=True))

This report only display top-level ops statistics
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg       CPU Mem  Self CPU Mem      CUDA Mem  Self CUDA Mem    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                           aten::linear         2.51%     545.000us        32.43%       7.040ms      64.587us       0.000us         0.00%       6.420ms      58.899us     

In [9]:
raise NotImplementedError("Here comes the latency of the steerable implementation.")

NotImplementedError: Here comes the latency of the steerable implementation.

## Memory

In [10]:
def test_memory(model, dataset_sample):
    start = torch.cuda.memory_allocated()
    out = model(dataset_sample)
    end = torch.cuda.memory_allocated()
    return end - start

In [11]:
gvp_model.train()
withgrad_latency = test_memory(gvp_model, dataset_sample)

gvp_model.eval()
with torch.no_grad():
    gradless_latency = test_memory(gvp_model, dataset_sample)

print(f"{withgrad_latency=}")
print(f"{gradless_latency=}")

withgrad_latency=487282176
gradless_latency=512


In [12]:
raise NotImplementedError("Here comes the latency of the steerable implementation.")

NotImplementedError: Here comes the latency of the steerable implementation.

## Conclusion?