# Latency and memory

## How to run this file on the lisa cluster

Start by pulling the lastest version of the repository, if necessary re-add the ssh-key to the agent.  ([See here](https://docs.github.com/en/authentication/connecting-to-github-with-ssh/generating-a-new-ssh-key-and-adding-it-to-the-ssh-agent?platform=linux)), i.e.

```sh
eval "$(ssh-agent -s)"
ssh-add ~/.ssh/id_ed25519
```

Then activate the correct modules (`2022` and `Anaconda3/2022.05`) and activate the `gvp` source.  Finally, actually run the notebook with 

```sh
jupyter nbconvert "latency and memory.ipynb" --to notebook --execute --inplace --allow-errors
```

Make sure the environment contains the following packages: `ipywidgets`, `jupyter`, `notebook`, `torch`, `gvp`

## Setup

* Import things necessary for running GVP and SMLP, 
* Setup which task we analyse, 
* ...

### Imports

In [4]:
# Supresses the following warning:
#   UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class.
#   This should only matter to you if you are using storages directly.
#   To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
import warnings
warnings.filterwarnings("ignore", category=UserWarning)

In [5]:
import time, torch, os, sys
import numpy as np

from e3nn.o3 import Irreps
from torch.profiler import profile, ProfilerActivity

# for fixing relative import issues
module_path = os.path.abspath(os.path.join('..'))
if module_path not in sys.path:
    sys.path.append(module_path)

from src.run_atom3d import get_datasets, get_model
from src.steerable_mlp import ConvModel

### Task setup
Adjust the data and model directories to match up with the correct folder.

In [6]:
DATA_DIR = "../atom3d-data/"

device = 'cuda' if torch.cuda.is_available() else 'cpu'
if device != "cuda":
    exit()

TASK = "LBA"
"""Task to test for model latency."""
LBA_SPLIT = 30
# SPLIT = 60

DATASET_SPLIT = "train"
"""Dataset partition from which the sample is chosen."""
# Select from these values
DATASET_SPLIT = ["train", "val", "test"].index(DATASET_SPLIT)
SAMPLE_INDEX = 10
"""Dataset index to use for testing the latency and memory."""

BEST_GVP_MODEL = "../src/reproduced_models/LBA_lba-split=30_47.pt"
"""Path to the best trained relevant model."""
BEST_SMLP_MODEL = "../src/sMLPmodels/LBA-lba_split=30-epoch=10.ckpt"
"""Path to the best trained relevant model."""
USE_DENSE = False

In [7]:
datasets = get_datasets(TASK, DATA_DIR, LBA_SPLIT)

gvp_model = get_model(TASK).to(device)
# gvp_model = nn.DataParallel(gvp_model)  # Add parallel to fix OOM issues?
gvp_model.load_state_dict(torch.load(BEST_GVP_MODEL), strict=False)

dataset_sample = datasets[DATASET_SPLIT][SAMPLE_INDEX].to(device)
# This is required for taking the scattermean of the graph
dataset_sample.batch = torch.zeros_like(dataset_sample.atoms)
# Fix for our steerable.
dataset_sample.z = dataset_sample.atoms
dataset_sample.pos = dataset_sample.x

print(dataset_sample)

Data(x=[551, 3], edge_index=[2, 13194], atoms=[551], edge_s=[13194, 16], edge_v=[13194, 1, 3], label=6.85, lig_flag=[551], batch=[551], z=[551], pos=[551, 3])


In [8]:
def balanced_irreps(hidden_features: int, lmax: int) -> Irreps:
    """Divide subspaces equally over the feature budget"""
    N = int(hidden_features / (lmax + 1))
    irreps = []
    for l, irrep in enumerate(Irreps.spherical_harmonics(lmax)):
        n = int(N / (2 * l + 1))
        irreps.append(str(n) + "x" + str(irrep[1]))
    irreps = "+".join(irreps)
    irreps = Irreps(irreps)
    gap = hidden_features - irreps.dim
    if gap > 0:
        irreps = Irreps("{}x0e".format(gap)) + irreps
        irreps = irreps.simplify()
    return irreps

irreps_in = Irreps("32x0e")
irreps_hidden = balanced_irreps(128, 1)
irreps_edge = Irreps.spherical_harmonics(1)
irreps_out = Irreps("16x0e") if USE_DENSE else Irreps("1x0e")
smlp_model = ConvModel(
    irreps_in=irreps_in,
    irreps_hidden=irreps_hidden,
    irreps_edge=irreps_edge,
    irreps_out=irreps_out,
    depth=3,
    dense=USE_DENSE,
).to(device)
smlp_model.load_state_dict(torch.load(BEST_SMLP_MODEL), strict=False)

_IncompatibleKeys(missing_keys=['embedder.weight', 'layers.0.conv.tp.weight', 'layers.0.conv.tp.output_mask', 'layers.0.conv.radial_net.net.0.freq', 'layers.0.conv.radial_net.net.1.weight', 'layers.0.conv.radial_net.net.1.bias', 'layers.0.conv.radial_net.net.3.weight', 'layers.0.conv.radial_net.net.3.bias', 'layers.0.gate.mul.weight', 'layers.0.gate.mul.output_mask', 'layers.1.conv.tp.weight', 'layers.1.conv.tp.output_mask', 'layers.1.conv.radial_net.net.0.freq', 'layers.1.conv.radial_net.net.1.weight', 'layers.1.conv.radial_net.net.1.bias', 'layers.1.conv.radial_net.net.3.weight', 'layers.1.conv.radial_net.net.3.bias', 'layers.1.gate.mul.weight', 'layers.1.gate.mul.output_mask', 'layers.2.conv.tp.weight', 'layers.2.conv.tp.output_mask', 'layers.2.conv.radial_net.net.0.freq', 'layers.2.conv.radial_net.net.1.weight', 'layers.2.conv.radial_net.net.1.bias', 'layers.2.conv.radial_net.net.3.weight', 'layers.2.conv.radial_net.net.3.bias'], unexpected_keys=['epoch', 'global_step', 'pytorch-li

In [9]:
# Burn a few times to prevent extra overhead for first CUDA profiling
with profile(activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA],
             profile_memory=True) as prof:
    for _ in range(30):
        gvp_model(dataset_sample)
        smlp_model(dataset_sample)

STAGE:2023-05-26 22:53:55 36294:36294 ActivityProfilerController.cpp:311] Completed Stage: Warm Up
STAGE:2023-05-26 22:53:58 36294:36294 ActivityProfilerController.cpp:317] Completed Stage: Collection
STAGE:2023-05-26 22:53:58 36294:36294 ActivityProfilerController.cpp:321] Completed Stage: Post Processing


In [10]:
def test_latency_memory(model, dataset_sample):
    memory_start = torch.cuda.memory_allocated()
    latencies = []
    for _ in range(100):
        latency_start = time.perf_counter()
        out = model(dataset_sample)
        latency_end = time.perf_counter()
        latencies.append(latency_end - latency_start)
    memory_end = torch.cuda.memory_allocated()
    return latencies, memory_end - memory_start

## GVP

In [11]:
gvp_model.train()
with profile(activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA],
             profile_memory=True) as prof_train:
    withgrad_latency, withgrad_memory = test_latency_memory(gvp_model, dataset_sample)
    # out = gvp_model(dataset_sample)


gvp_model.eval()
with profile(activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA],
            profile_memory=True) as prof_test:
    with torch.no_grad():
        gradless_latency, gradless_memory = test_latency_memory(gvp_model, dataset_sample)
        # out = gvp_model(dataset_sample)

print(f"{np.mean(withgrad_latency)=}")
print(f"{np.std(withgrad_latency)=}")
print(f"{np.mean(gradless_latency)=}")
print(f"{np.std(gradless_latency)=}")
print()
print(f"{withgrad_memory=}")
print(f"{gradless_memory=}")

STAGE:2023-05-26 22:54:04 36294:36294 ActivityProfilerController.cpp:311] Completed Stage: Warm Up
STAGE:2023-05-26 22:54:07 36294:36294 ActivityProfilerController.cpp:317] Completed Stage: Collection
STAGE:2023-05-26 22:54:07 36294:36294 ActivityProfilerController.cpp:321] Completed Stage: Post Processing
STAGE:2023-05-26 22:54:22 36294:36294 ActivityProfilerController.cpp:311] Completed Stage: Warm Up
STAGE:2023-05-26 22:54:25 36294:36294 ActivityProfilerController.cpp:317] Completed Stage: Collection
STAGE:2023-05-26 22:54:25 36294:36294 ActivityProfilerController.cpp:321] Completed Stage: Post Processing


np.mean(withgrad_latency)=0.028397920759980478
np.std(withgrad_latency)=0.001347460038363938
np.mean(gradless_latency)=0.02743657376993724
np.std(gradless_latency)=0.0014193098946963653

withgrad_memory=485300736
gradless_memory=512


In [12]:
print(f"{withgrad_latency=}")
print(f"{gradless_latency=}")

withgrad_latency=[0.02941793699937989, 0.03024948799975391, 0.03107708800052933, 0.03152759400109062, 0.030469551000351203, 0.027549861999432324, 0.02838385399991239, 0.03034137900067435, 0.02988938400085317, 0.027877545000592363, 0.028723559000354726, 0.030156337001244538, 0.03805791800004954, 0.029140702001313912, 0.029382736998741166, 0.02696531500077981, 0.026971045999744092, 0.028546706000270206, 0.02732309000020905, 0.027732255999580957, 0.028429602998585324, 0.027029966999180033, 0.02693881500090356, 0.028392784000971005, 0.02826142300000356, 0.02702202600085002, 0.029093912000462296, 0.026965175000441377, 0.02702754699930665, 0.02818061099969782, 0.02777253599924734, 0.0280218279986002, 0.028134771000623005, 0.02718443699995987, 0.027049117001297418, 0.02834183400045731, 0.027667943999404088, 0.02803929000037897, 0.028633095998884528, 0.026852433999010827, 0.027679074999468867, 0.028258971999093774, 0.02864645699992252, 0.028158130999145214, 0.029006161001234432, 0.027489562999

In [13]:
print(prof_train.key_averages().table(sort_by="cuda_time_total", row_limit=10, top_level_events_only=True))
print(prof_train.key_averages().table(sort_by="cuda_memory_usage", row_limit=10, top_level_events_only=True))

This report only display top-level ops statistics
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg       CPU Mem  Self CPU Mem      CUDA Mem  Self CUDA Mem    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                           aten::linear         1.98%      46.382ms        20.51%     479.884ms      44.026us       0.000us         0.00%        2.007s     184.088us     

In [14]:
print(prof_test.key_averages().table(sort_by="cuda_time_total", row_limit=10, top_level_events_only=True))
print(prof_test.key_averages().table(sort_by="cuda_memory_usage", row_limit=10, top_level_events_only=True))

This report only display top-level ops statistics
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg       CPU Mem  Self CPU Mem      CUDA Mem  Self CUDA Mem    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                           aten::linear         1.51%      35.219ms        17.53%     409.280ms      37.549us       0.000us         0.00%        1.917s     175.903us     

## Steerable

In [15]:
gvp_model.train()
with profile(activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA],
             profile_memory=True) as prof_train:
    withgrad_latency, withgrad_memory = test_latency_memory(smlp_model, dataset_sample)
    # out = gvp_model(dataset_sample)


gvp_model.eval()
with profile(activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA],
            profile_memory=True) as prof_test:
    with torch.no_grad():
        gradless_latency, gradless_memory = test_latency_memory(smlp_model, dataset_sample)
        # out = gvp_model(dataset_sample)

print(f"{np.mean(withgrad_latency)=}")
print(f"{np.std(withgrad_latency)=}")
print(f"{np.mean(gradless_latency)=}")
print(f"{np.std(gradless_latency)=}")
print()
print(f"{withgrad_memory=}")
print(f"{gradless_memory=}")

STAGE:2023-05-26 22:55:02 36294:36294 ActivityProfilerController.cpp:311] Completed Stage: Warm Up
STAGE:2023-05-26 22:55:03 36294:36294 ActivityProfilerController.cpp:317] Completed Stage: Collection
STAGE:2023-05-26 22:55:03 36294:36294 ActivityProfilerController.cpp:321] Completed Stage: Post Processing
STAGE:2023-05-26 22:55:07 36294:36294 ActivityProfilerController.cpp:311] Completed Stage: Warm Up
STAGE:2023-05-26 22:55:09 36294:36294 ActivityProfilerController.cpp:317] Completed Stage: Collection
STAGE:2023-05-26 22:55:09 36294:36294 ActivityProfilerController.cpp:321] Completed Stage: Post Processing


np.mean(withgrad_latency)=0.010318623949933681
np.std(withgrad_latency)=0.0013195587161539775
np.mean(gradless_latency)=0.009912540109980909
np.std(gradless_latency)=0.0005958322722224627

withgrad_memory=706749952
gradless_memory=512


In [16]:
print(f"{withgrad_latency=}")
print(f"{gradless_latency=}")

withgrad_latency=[0.018917412000519107, 0.013464051999108051, 0.009913267000229098, 0.010768768001071294, 0.011467056001492892, 0.009082956001293496, 0.011122882999188732, 0.010646086000633659, 0.009072086999367457, 0.01115228300113813, 0.009541612000248278, 0.010101908999786247, 0.011302853999950457, 0.009239147999323905, 0.011655709000478964, 0.011258713999268366, 0.009558992000165745, 0.011496747998535284, 0.010627176001435146, 0.009756256000400754, 0.01061187599952973, 0.008987033999801497, 0.01125229399985983, 0.00983872599863389, 0.009477190998950391, 0.010618095000609173, 0.009424881998711498, 0.010792778000904946, 0.009886857000310556, 0.00948845099992468, 0.011331584999425104, 0.0088791329999367, 0.010206269998889184, 0.009661004000008688, 0.009672153999417787, 0.01151277799908712, 0.00903125599870691, 0.010042007999800262, 0.010192319999987376, 0.009418179999556742, 0.011632959000053233, 0.009020655001222622, 0.010135259999515256, 0.010762488000182202, 0.009352781000416144, 0

In [17]:
print(prof_train.key_averages().table(sort_by="cuda_time_total", row_limit=10, top_level_events_only=True))
print(prof_train.key_averages().table(sort_by="cuda_memory_usage", row_limit=10, top_level_events_only=True))

This report only display top-level ops statistics
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg       CPU Mem  Self CPU Mem      CUDA Mem  Self CUDA Mem    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                forward         3.00%      27.124ms        23.67%     214.027ms     305.753us       0.000us         0.00%     572.635ms     818.050us     

In [18]:
print(prof_test.key_averages().table(sort_by="cuda_time_total", row_limit=10, top_level_events_only=True))
print(prof_test.key_averages().table(sort_by="cuda_memory_usage", row_limit=10, top_level_events_only=True))

This report only display top-level ops statistics
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg       CPU Mem  Self CPU Mem      CUDA Mem  Self CUDA Mem    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                forward         3.10%      27.375ms        20.50%     181.183ms     258.833us       0.000us         0.00%     568.922ms     812.746us     

## Conclusion

It seems that the profiling does not take into account the fact that gradients have to be stored.  We think this, because the additional memory allocated to CUDA is more than 450 MB and more than 700 MB respectively when storing the gradients, but the profiler only gives 250 MB and 630 MB (per iteration) at most.