## Performance Analysis of Concurrent Model Execution on Separate GPUs

**Reference:** [https://discuss.pytorch.org/t/assign-two-models-to-two-gpus-and-run-concurrently/150815/2](https://discuss.pytorch.org/t/assign-two-models-to-two-gpus-and-run-concurrently/150815/2)

**Experiment Setup:** `1000 Forward Process, no backward or optimization`

**Individual Model Execution Times:**

*   **DenseNet-121:** Took ~35.73 seconds to process 1000 forward passes.
*   **ResNet-101:** Took ~62.80 seconds to process 1000 forward passes.
    *   ResNet-101 is significantly slower than DenseNet-121.

**Total Parallel Execution Time:**

*   ~63.79 seconds, which is close to the time taken by the slower model (ResNet-101).
    *   This is because parallel execution completes when the longest task finishes.

**Comparison to Sequential Execution:**

*   If executed sequentially (one after another), the total time would be:
    ```
    35.73s + 62.80s = ~98.52s
    ```
*   By running concurrently on separate GPUs, the execution time was reduced by ~35.4%, achieving a speedup of ~1.54×.

**Performance Bottleneck:**

*   The execution time is primarily dictated by ResNet-101, since `ThreadPoolExecutor` runs both models in parallel, but the script only completes when the slowest model is done.

**Potential for Further Speedup:**

*   Using CUDA streams instead of Python threads might further reduce overhead.
*   Profiling could reveal whether GPU memory access or compute bottlenecks are causing the disparity in execution times.

In [1]:
import torch
import torch.nn as nn
import torchvision
from concurrent.futures import ThreadPoolExecutor, wait
import time

device_0 = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
device_1 = torch.device("cuda:1" if torch.cuda.is_available() else "cpu")

start_time = time.time()

dense = torchvision.models.densenet121(pretrained=False).to(device_0)
rest = torchvision.models.resnet101(pretrained=False).to(device_1)

dummy1 = torch.ones((200,3,32,32)).to(device_0)
dummy2 = torch.ones((200,3,32,32)).to(device_1)

def train(model,value,epoch):
    ti = time.perf_counter()
    for _ in range(epoch):
        ret = model(value)
    to = time.perf_counter()
    return ret.shape, to-ti

output = []

with ThreadPoolExecutor() as executor:
    futures = []
    futures.append(executor.submit(train, dense, dummy1, 1000))
    futures.append(executor.submit(train, rest, dummy2, 1000))
    complete_futures, incomplete_futures = wait(futures)    # waits untils both the processes are completed
    for f in complete_futures:
        output.append(f.result())
        print(str(f.result()))

elapsed = (time.time() - start_time)
print(f"Total time of execution {round(elapsed, 4)} second(s)")
print("Output is:",output)



(torch.Size([200, 1000]), 37.140874870000005)
(torch.Size([200, 1000]), 63.63703838299995)
Total time of execution 64.9472 second(s)
Output is: [(torch.Size([200, 1000]), 37.140874870000005), (torch.Size([200, 1000]), 63.63703838299995)]


In [2]:
import torch
import torch.nn as nn
import torchvision
from concurrent.futures import ThreadPoolExecutor, wait
import time

device_0 = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
# device_1 = torch.device("cuda:1" if torch.cuda.is_available() else "cpu")

start_time = time.time()

dense = torchvision.models.densenet121(pretrained=False).to(device_0)
# rest = torchvision.models.resnet101(pretrained=False).to(device_0)

dummy1 = torch.ones((200,3,32,32)).to(device_0)
# dummy2 = torch.ones((200,3,32,32)).to(device_0)

def train(model,value,epoch):
    ti = time.perf_counter()
    for _ in range(epoch):
        ret = model(value)
    to = time.perf_counter()
    return ret.shape, to-ti

output = []


train(dense, dummy1, 1000)

elapsed = (time.time() - start_time)
print(f"Total time of execution {round(elapsed, 4)} second(s)")
# print("Output is:",output)

Total time of execution 24.8542 second(s)


In [5]:
import torch
import torch.nn as nn
import torchvision
from concurrent.futures import ThreadPoolExecutor, wait
import time

device_0 = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
# device_1 = torch.device("cuda:1" if torch.cuda.is_available() else "cpu")

start_time = time.time()

# dense = torchvision.models.densenet121(pretrained=False).to(device_0)
rest = torchvision.models.resnet101(pretrained=False).to(device_1)

# dummy1 = torch.ones((200,3,32,32)).to(device_0)
dummy2 = torch.ones((200,3,32,32)).to(device_1)

def train(model,value,epoch):
    ti = time.perf_counter()
    for _ in range(epoch):
        ret = model(value)
    to = time.perf_counter()
    return ret.shape, to-ti

output = []

train(rest, dummy2, 1000)

elapsed = (time.time() - start_time)
print(f"Total time of execution {round(elapsed, 4)} second(s)")
print("Output is:",output)

Total time of execution 68.2755 second(s)
Output is: []


## Analysis of Profiling Results for **One Epoch**

Your profiling output provides key insights into the execution performance of DenseNet-121 and ResNet-101 on separate GPUs.

| Metric                | DenseNet-121 (GPU 0) | ResNet-101 (GPU 1) |
| --------------------- | -------------------- | ------------------- |
| Total CUDA Time       | 49.75 ms             | 52.48 ms            |
| Total CPU Time        | 50.39 ms             | 41.57 ms            |
| Batch Norm CUDA Time  | 7.62 ms              | 4.43 ms             |
| Convolution CUDA Time | 23.20 ms             | 44.73 ms            |
| Memory Usage          | 513.29 MB            | 500.89 MB           |

### 2️⃣ Why is ResNet-101 Slower?

**Heavy Convolution Overhead**

*   ResNet-101 spends 86.42% of CUDA time (44.73ms) on `aten::cudnn_convolution`, whereas DenseNet-121 only spends 65.33% (23.20ms).
*   This means that ResNet-101's deep convolutional layers are the biggest performance bottleneck.

**SGEMM Kernel Operations**

*   ResNet-101 shows heavy CUDA activity in `volta_sgemm_64x64_nn` and `volta_sgemm_128x64_nn` kernels (Total: 28.2ms).
*   These are matrix-multiplication operations used in convolutions, which contribute significantly to processing time.

**Batch Normalization Impact**

*   DenseNet-121 spends 7.62ms on batch normalization, whereas ResNet-101 spends only 4.42ms.
*   This suggests DenseNet-121 has more frequent or larger batch norm operations, but they are not a major bottleneck.

**Memory Usage**

*   DenseNet-121: 513.29MB
*   ResNet-101: 500.89MB
*   Memory allocation is comparable, meaning the difference in speed is likely due to computation complexity rather than memory bandwidth.

### 3️⃣ How to Optimize Performance?

**✅ Use Mixed Precision (AMP) to Speed Up Convolutions**

*   ResNet-101 spends a lot of time in SGEMM operations and convolutions, which can be optimized using automatic mixed precision (AMP).
*   Try this in your training loop:

    ```python
    with torch.cuda.amp.autocast():
        output = model(input)
    ```
    *   This can reduce computational cost and improve speed with minimal loss of accuracy.

**✅ Use CUDA Streams for Better Parallelism**

*   Instead of using `ThreadPoolExecutor`, try CUDA streams to overlap computation:

    ```python
    stream0 = torch.cuda.Stream(device_0)
    stream1 = torch.cuda.Stream(device_1)

    with torch.cuda.stream(stream0):
        output1 = dense(dummy1)

    with torch.cuda.stream(stream1):
        output2 = rest(dummy2)
    ```
    *   This allows both models to run in parallel without blocking.

**✅ Increase Batch Size (If GPU Memory Allows)**

*   Since memory usage is similar (~500MB), try increasing batch size to improve GPU utilization.
*   Example:

    ```python
    dummy1 = torch.ones((400,3,32,32)).to(device_0)  # Increase batch size
    dummy2 = torch.ones((400,3,32,32)).to(device_1)
    ```
    *   If GPU memory can handle it, this reduces batch normalization and convolution overhead per sample.

### 4️⃣ Summary

*   ResNet-101 is slower mainly due to convolution overhead (44.73ms CUDA time vs. 23.20ms in DenseNet-121).
*   SGEMM operations (matrix multiplications) in ResNet-101 take significant time.
*   Memory usage is similar, meaning performance bottlenecks are computational rather than memory-related.
*   Optimization suggestions:
    *   Enable Mixed Precision (AMP) to optimize convolutions.
    *   Use CUDA streams instead of CPU threads for true parallelism.
    *   Increase batch size to improve GPU utilization.

In [6]:
import torch
import torchvision
import time
import torch.profiler

# Assign devices
device_0 = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
device_1 = torch.device("cuda:1" if torch.cuda.is_available() else "cpu")

# Load models
dense = torchvision.models.densenet121(pretrained=False).to(device_0)
rest = torchvision.models.resnet101(pretrained=False).to(device_1)

# Create dummy input tensors
dummy1 = torch.ones((200,3,32,32)).to(device_0)
dummy2 = torch.ones((200,3,32,32)).to(device_1)

# Profiling function for 1000 epochs
def profile_model(model, input_tensor, device, model_name, epochs=1000):
    print(f"\nProfiling {model_name} on {device} for {epochs} epochs...\n")

    with torch.profiler.profile(
        activities=[torch.profiler.ProfilerActivity.CPU, torch.profiler.ProfilerActivity.CUDA],
        schedule=torch.profiler.schedule(
            wait=5,   # Skip first 5 iterations (warm-up)
            warmup=10,  # Profile next 10 iterations
            active=20,  # Collect profiling for 20 iterations
            repeat=5  # Repeat profiling 5 times during execution
        ),
        record_shapes=True,
        with_stack=True,
        profile_memory=True
    ) as prof:
        torch.cuda.synchronize(device)  # Ensure all previous operations are done
        start_time = time.perf_counter()

        for epoch in range(epochs):
            with torch.profiler.record_function("Model Forward Pass"):
                _ = model(input_tensor)  # Run forward pass
            
            prof.step()  # Step the profiler at each iteration
        
        torch.cuda.synchronize(device)  # Ensure all GPU operations are finished
        end_time = time.perf_counter()
    
    execution_time = round(end_time - start_time, 4)
    print(f"{model_name} execution time for {epochs} epochs: {execution_time}s")

    # Print top profiling events
    print(prof.key_averages().table(sort_by="cuda_time_total", row_limit=10))
    
    # Save profiling results
    prof.export_chrome_trace(f"{model_name}_profile.json")
    print(f"Profile saved: {model_name}_profile.json")

    return execution_time

# Profile DenseNet-121
dense_time = profile_model(dense, dummy1, device_0, "DenseNet-121", epochs=1000)

# Profile ResNet-101
rest_time = profile_model(rest, dummy2, device_1, "ResNet-101", epochs=1000)

# Print Summary
print("\n====== Execution Summary ======")
print(f"DenseNet-121 Execution Time for 1000 epochs: {dense_time}s")
print(f"ResNet-101 Execution Time for 1000 epochs: {rest_time}s")
print("Profiling complete! 🚀")


Profiling DenseNet-121 on cuda:0 for 1000 epochs...

DenseNet-121 execution time for 1000 epochs: 46.2665s
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg       CPU Mem  Self CPU Mem      CUDA Mem  Self CUDA Mem    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                     Model Forward Pass         0.00%       0.000us         0.00%       0.000us       0.000us   

### CUDA Stream

In [7]:
import torch
import torchvision
import time

# Assign devices
device_0 = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
device_1 = torch.device("cuda:1" if torch.cuda.is_available() else "cpu")

# Load models
dense = torchvision.models.densenet121(pretrained=False).to(device_0)
rest = torchvision.models.resnet101(pretrained=False).to(device_1)

# Create dummy input tensors (Increased batch size for better GPU utilization)
dummy1 = torch.ones((400,3,32,32)).to(device_0)  # Increased batch size
dummy2 = torch.ones((400,3,32,32)).to(device_1)

# Create CUDA Streams for Parallel Execution
stream0 = torch.cuda.Stream(device_0)
stream1 = torch.cuda.Stream(device_1)

# Function to Run Model with Mixed Precision (AMP) for 1000 Epochs
def train_with_amp(model, input_tensor, stream, device, model_name, epochs=1000):
    torch.cuda.synchronize(device)  # Ensure all previous operations are done
    print(f"Running {model_name} on {device} for {epochs} epochs...")

    with torch.cuda.stream(stream):  # Assign computation to the stream
        with torch.autocast(device_type='cuda', dtype=torch.float16):  # Enable Mixed Precision
            start_time = time.perf_counter()
            for _ in range(epochs):  # Run for 1000 epochs
                output = model(input_tensor)  # Forward Pass
            torch.cuda.synchronize(device)  # Sync GPU operations
            end_time = time.perf_counter()
    
    execution_time = round(end_time - start_time, 4)
    print(f"{model_name} execution time for {epochs} epochs: {execution_time}s")
    return output.shape, execution_time
    
# Run both models in parallel using CUDA Streams
dense_result = train_with_amp(dense, dummy1, stream0, device_0, "DenseNet-121", epochs=1000)
rest_result = train_with_amp(rest, dummy2, stream1, device_1, "ResNet-101", epochs=1000)

# Print final results
print("\n====== Execution Summary ======")
print(f"DenseNet-121 Output Shape: {dense_result[0]}, Execution Time: {dense_result[1]}s")
print(f"ResNet-101 Output Shape: {rest_result[0]}, Execution Time: {rest_result[1]}s")
print("Execution complete! 🚀")

Running DenseNet-121 on cuda:0 for 1000 epochs...
DenseNet-121 execution time for 1000 epochs: 32.2444s
Running ResNet-101 on cuda:1 for 1000 epochs...
ResNet-101 execution time for 1000 epochs: 28.7406s

DenseNet-121 Output Shape: torch.Size([400, 1000]), Execution Time: 32.2444s
ResNet-101 Output Shape: torch.Size([400, 1000]), Execution Time: 28.7406s
Execution complete! 🚀
