
# Benchmarking GPUs in Kaggle

In this Kaggle notebook there is an adaptation from my Benchmark CPU to GPU using pytorch benchmark. The main method (benchmark) it change the input parameters, now it just needs the sizes to process. We are going to check the results and analyse them. Besides, there's my own method for timing, and it'is going to be used to analyse the timeit library from pytorch. 

## My CPU Benchmark adapted to GPU

Pytorch has hard coded a block size of 256 threads. So there's only one execution per matrix size.

One important advertisment! GPU accelerator has to be activated to use this notebook, if not, the notebook is not going to compile.

In [1]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

#Author: Raquel Ricoy

#Benchmark to study Kaggle's GPUs, CPUs and TPUs potential.
#It's going to use Pytorch and to stablish a script to calculate its performance and GFLOPS.

#Install pytorch
#!conda install -y pytorch torchvision -c pytorch

import torch

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory
import os
#print(os.listdir("../input"))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

#Importing Libraries needed for use torch
import timeit
import torch.utils.benchmark as benchmark

In [2]:
#Functions obtained from Torch Webpages by PyTorch Benchmarks
def batched_dot_mul_sum(a, b):
    '''Computes batched dot by multiplying and summing'''
    return a.mul(b).sum(-1)

def batched_dot_bmm(a, b):
    '''Computes batched dot by reducing to bmm'''
    a = a.reshape(-1, 1, a.shape[-1])
    b = b.reshape(-1, b.shape[-1], 1)
    return torch.bmm(a, b).flatten(-3)

In [3]:
# Method that do the benchmark and compare results with dot mul sum implementations and vectorSum
#Anotation: We cannot change the block threads in pytorch for GPU, it's always 256 threads per block! So the 1 threads that put in benchmark is erroneus and put it by default
def benchMark(sizes):
    results = []
    if(len(sizes) == 0):
        print("Parameter 'sizes' has to a have minumun of 1 parameters")
        return
    
    for n in sizes:
        # label and sub_label are the rows
        # description is the column
        label = 'Batched dot'
        sub_label = f'[{n}, {n}]'
        xCPU = torch.ones(n, n)
        xCUDA = xCPU.to(device="cuda:0")
        results.append(benchmark.Timer(
                stmt='batched_dot_mul_sum(x, x)',
                setup='from __main__ import batched_dot_mul_sum',
                globals={'x': xCUDA},
                label=label,
                sub_label=sub_label,
                description='mul/sum',
            ).blocked_autorange(min_run_time=1))
        results.append(benchmark.Timer(
                stmt='batched_dot_bmm(x, x)',
                setup='from __main__ import batched_dot_bmm',
                globals={'x': xCUDA},
                label=label,
                sub_label=sub_label,
                description='bmm',
            ).blocked_autorange(min_run_time=1))
    compare = benchmark.Compare(results)
    compare.print()
    return compare

In [4]:
#Evaluating with which GPU we are going to use
if torch.cuda.is_available(): 
    print("GPU is available")
    print("GPU device where we are gonna execute tests: ",torch.cuda.get_device_name())
else:
    print("GPU is NOT available")


GPU is available
GPU device where we are gonna execute tests:  Tesla P100-PCIE-16GB


In [5]:
#The limit dimension of the sizes is [65536,65536]. It is running out of memory with that sizes
sizes = [512,1024,2048,4096,8192,16384,32768]
compares = []

#The benchmark execute 5 times to gather data and afterwards 
for i in range(0,5):
    print("Benchmark execution: ",i+1, "\n")
    compares.insert(i,benchMark(sizes))

Benchmark execution:  1 

[-------------- Batched dot ---------------]
                      |  mul/sum  |    bmm  
1 threads: ---------------------------------
      [512, 512]      |     19.1  |     33.1
      [1024, 1024]    |     27.6  |     19.9
      [2048, 2048]    |     99.5  |     71.9
      [4096, 4096]    |    379.9  |    267.7
      [8192, 8192]    |   1454.5  |    993.5
      [16384, 16384]  |   5753.3  |   3851.8
      [32768, 32768]  |  22984.6  |  15215.4

Times are in microseconds (us).

Benchmark execution:  2 

[-------------- Batched dot ---------------]
                      |  mul/sum  |    bmm  
1 threads: ---------------------------------
      [512, 512]      |     19.2  |     19.8
      [1024, 1024]    |     27.7  |     20.0
      [2048, 2048]    |     99.5  |     72.1
      [4096, 4096]    |    380.1  |    267.8
      [8192, 8192]    |   1454.8  |    994.0
      [16384, 16384]  |   5754.0  |   3852.5
      [32768, 32768]  |  22986.4  |  15218.7

Times are in 

In [6]:
#OWN METHODs
# We are going to use the library time from python and do the syncronizations to the gpu device
import time #-> time.time() returns the time in seconds

cuda0 = torch.device("cuda:0")

sizes = [512,1024,2048,4096,8192,16384,32768] # maximun size withou running out memory -> 65536

#Firstly batched_dot_mul_sum
for i in range(0,5):
    print("\nBenchmark execution for batched_dot_mul_sum: ",i+1, "\n")
    for n in sizes:
        timeInit = time.time()

        xCPU = torch.ones(n, n)
        xCUDA = xCPU.to(device=cuda0)

        timeInitMulSum = time.time()
        batched_dot_mul_sum(xCUDA,xCUDA)
        torch.cuda.synchronize()
        timeFinishMulSum = time.time()

        timeFinish = time.time()

        print(f"size matrix [{n}] -> {(timeFinish - timeInit):0.8f} s")


Benchmark execution for batched_dot_mul_sum:  1 

size matrix [512] -> 0.00163770 s
size matrix [1024] -> 0.00173306 s
size matrix [2048] -> 0.00610614 s
size matrix [4096] -> 0.04885530 s
size matrix [8192] -> 0.19165254 s
size matrix [16384] -> 0.75894785 s
size matrix [32768] -> 3.18522787 s

Benchmark execution for batched_dot_mul_sum:  2 

size matrix [512] -> 0.13693380 s
size matrix [1024] -> 0.00176907 s
size matrix [2048] -> 0.00612235 s
size matrix [4096] -> 0.04758263 s
size matrix [8192] -> 0.19416809 s
size matrix [16384] -> 0.76451230 s
size matrix [32768] -> 3.05518007 s

Benchmark execution for batched_dot_mul_sum:  3 

size matrix [512] -> 0.13375497 s
size matrix [1024] -> 0.00179148 s
size matrix [2048] -> 0.00639558 s
size matrix [4096] -> 0.04699349 s
size matrix [8192] -> 0.19169211 s
size matrix [16384] -> 0.76087332 s
size matrix [32768] -> 3.03962612 s

Benchmark execution for batched_dot_mul_sum:  4 

size matrix [512] -> 0.13376665 s
size matrix [1024] -> 0.

In [7]:
#Now batched_dot_bmm
for i in range(0,5):
    print("\nBenchmark execution for batched_dot_bmm: ",i+1, "\n")
    for n in sizes:
        timeInit = time.time()

        xCPU = torch.ones(n, n)
        xCUDA = xCPU.to(device=cuda0)
        batched_dot_bmm(xCUDA,xCUDA)
        torch.cuda.synchronize()
        
        timeFinish = time.time()

        print(f"size matrix [{n}] -> {(timeFinish - timeInit):0.8f} s")


Benchmark execution for batched_dot_bmm:  1 

size matrix [512] -> 0.13341093 s
size matrix [1024] -> 0.00225258 s
size matrix [2048] -> 0.00596809 s
size matrix [4096] -> 0.04708171 s
size matrix [8192] -> 0.19160604 s
size matrix [16384] -> 0.75696254 s
size matrix [32768] -> 3.18731189 s

Benchmark execution for batched_dot_bmm:  2 

size matrix [512] -> 0.13329959 s
size matrix [1024] -> 0.00174069 s
size matrix [2048] -> 0.00616002 s
size matrix [4096] -> 0.04675460 s
size matrix [8192] -> 0.18924212 s
size matrix [16384] -> 0.75205564 s
size matrix [32768] -> 3.01112890 s

Benchmark execution for batched_dot_bmm:  3 

size matrix [512] -> 0.14559960 s
size matrix [1024] -> 0.00174379 s
size matrix [2048] -> 0.00632596 s
size matrix [4096] -> 0.05021501 s
size matrix [8192] -> 0.19839668 s
size matrix [16384] -> 0.75647831 s
size matrix [32768] -> 3.02638221 s

Benchmark execution for batched_dot_bmm:  4 

size matrix [512] -> 0.13623810 s
size matrix [1024] -> 0.00173807 s
size 

In [8]:
#To generate an out_file, not necessary to use
#Generate a file.out with the results.
#Benchmark from pytorch just generate a print from the sdtout, so we need to change the stdout to write it in a file.
#import sys

#original_stdout = sys.stdout # Save a reference to the original standard output

#with open('output_gpu_benchmark.out', 'w') as file:
#    sys.stdout = file # Change the standard output to the file we created.
#    i=1
#    for compare in compares:
#        print("Benchmark execution: ",i, "\n")
#        compare.print()
#        i += 1
#    sys.stdout = original_stdout # Reset the standard output to its original value