# Benchmark for TPU using pytorch

This code is going to do some computational test about the performance that a TPU can obtain. It's an adaptation from my previuous benchmarks using pytorch. However, the script to use pytorch-xla (the module that uses the TPU) it's only available to use with pytorch 1.6, it's not available to use it with current version (1.7), so, the BenchMark module from pytorch it's not included and it has to be replaced it by using the timeit module.

Using timeit module made that executions can have a warmp up delay of a 2 us approximately. Besides, there's my own method for timing, and it'is going to be used to analyse the timeit library from pytorch

In [1]:
!curl https://raw.githubusercontent.com/pytorch/xla/master/contrib/scripts/env-setup.py -o pytorch-xla-env-setup.py
!python pytorch-xla-env-setup.py  --apt-packages libomp5 libopenblas-dev # --version=pytorch-1.8

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  5116  100  5116    0     0  29572      0 --:--:-- --:--:-- --:--:-- 29572
Updating... This may take around 2 minutes.
Updating TPU runtime to pytorch-dev20200515 ...
Found existing installation: torch 1.7.0
Uninstalling torch-1.7.0:
Done updating TPU runtime
  Successfully uninstalled torch-1.7.0
Found existing installation: torchvision 0.8.1
Uninstalling torchvision-0.8.1:
  Successfully uninstalled torchvision-0.8.1
Copying gs://tpu-pytorch/wheels/torch-nightly+20200515-cp37-cp37m-linux_x86_64.whl...

Operation completed over 1 objects/91.0 MiB.                                     
Copying gs://tpu-pytorch/wheels/torch_xla-nightly+20200515-cp37-cp37m-linux_x86_64.whl...

Operation completed over 1 objects/119.5 MiB.                                    
Copying gs://tpu-pytorch/wheels/torchvision-nightly+202

In [2]:
# imports pytorch
import torch

print(torch.__version__)

# imports the torch_xla package
import torch_xla
import torch_xla.core.xla_model as xm
import platform
import os
#Importing Libraries needed for use torch
import timeit
#import torch.utils.benchmark as benchmark #torch_xla it is not compatible with 1.7, where it is the benchmark library

1.6.0a0+bf2bbd9


In [3]:
#Functions obtained from Torch Webpages por PyTorch Benchmarks
def batched_dot_mul_sum(a, b):
    '''Computes batched dot by multiplying and summing'''
    return a.mul(b).sum(-1)


def batched_dot_bmm(a, b):
    '''Computes batched dot by reducing to bmm'''
    a = a.reshape(-1, 1, a.shape[-1])
    b = b.reshape(-1, b.shape[-1], 1)
    return torch.bmm(a, b).flatten(-3)

In [4]:
def benchMark(sizes,dev):   
    for n in sizes:
        x = torch.ones((n, n))
        x = x.to(device=dev)
        t0 = timeit.Timer(
        stmt='batched_dot_mul_sum(x, x)',
        setup='from __main__ import batched_dot_mul_sum',
        globals={'x': x})

        t1 = timeit.Timer(
        stmt='batched_dot_bmm(x, x)',
        setup='from __main__ import batched_dot_bmm',
        globals={'x': x})

        print('size of square matrix: ',n)
        print(f'mul_sum(x, x):  {t0.timeit(5) / 100 * 1e6:>5.1f} us')
        print(f'bmm(x, x):      {t1.timeit(5) / 100 * 1e6:>5.1f} us\n')

In [5]:
for i in range(0,5):
        print("Benchmark execution: ",i+1, "\n")
        benchMark(sizes,dev)

Benchmark execution:  1 



NameError: name 'sizes' is not defined

In [6]:
#This block of code was used to debug a problem with FLOPs and timing because of pytorch benchmark
#Pytorch Benchmark i belive calculates erroneusly the timings, it should be x10
import sys
import time #-> time.time() returns the time in seconds

dev = xm.xla_device()
sizes = [512,1024,2048,4096,8192,16384] # maximun size withou running out memory -> 32768

#Firstly batched_dot_mul_sum
for i in range(0,5):
    print("\nBenchmark execution for batched_dot_mul_sum: ",i+1, "\n")
    for n in sizes:
        timeInit = time.time()

        xCPU = torch.ones(n, n)
        xTPU = xCPU.to(device=dev)

        batched_dot_mul_sum(xTPU,xTPU)

        timeFinish = time.time()

        print(f"size matrix [{n}] -> {(timeFinish - timeInit):0.8f} s")


Benchmark execution for batched_dot_mul_sum:  1 

size matrix [512] -> 0.01723552 s
size matrix [1024] -> 0.03187752 s
size matrix [2048] -> 0.10471439 s
size matrix [4096] -> 0.46381640 s
size matrix [8192] -> 1.52552986 s
size matrix [16384] -> 5.70313144 s

Benchmark execution for batched_dot_mul_sum:  2 

size matrix [512] -> 0.04820681 s
size matrix [1024] -> 0.01482964 s
size matrix [2048] -> 0.05203986 s
size matrix [4096] -> 0.30052972 s
size matrix [8192] -> 1.18074107 s
size matrix [16384] -> 5.41844082 s

Benchmark execution for batched_dot_mul_sum:  3 

size matrix [512] -> 0.04722190 s
size matrix [1024] -> 0.01320314 s
size matrix [2048] -> 0.05595207 s
size matrix [4096] -> 0.29808450 s
size matrix [8192] -> 1.28567433 s
size matrix [16384] -> 4.82329321 s

Benchmark execution for batched_dot_mul_sum:  4 

size matrix [512] -> 0.04757476 s
size matrix [1024] -> 0.01040030 s
size matrix [2048] -> 0.03689814 s
size matrix [4096] -> 0.30336070 s
size matrix [8192] -> 1.271

In [7]:
#Now batched_dot_bmm
for i in range(0,5):
    print("\nBenchmark execution for batched_dot_bmm: ",i+1, "\n")
    for n in sizes:
        timeInit = time.time()

        xCPU = torch.ones(n, n)
        xTPU = xCPU.to(device=dev)
        batched_dot_bmm(xTPU,xTPU)
        
        timeFinish = time.time()

        print(f"size matrix [{n}] -> {(timeFinish - timeInit):0.8f} s")


Benchmark execution for batched_dot_bmm:  1 

size matrix [512] -> 0.04796410 s
size matrix [1024] -> 0.02043056 s
size matrix [2048] -> 0.04794312 s
size matrix [4096] -> 0.30063081 s
size matrix [8192] -> 1.27892423 s
size matrix [16384] -> 5.13432431 s

Benchmark execution for batched_dot_bmm:  2 

size matrix [512] -> 0.05016899 s
size matrix [1024] -> 0.01019764 s
size matrix [2048] -> 0.05459046 s
size matrix [4096] -> 0.30239129 s
size matrix [8192] -> 1.27758646 s
size matrix [16384] -> 4.99838758 s

Benchmark execution for batched_dot_bmm:  3 

size matrix [512] -> 0.04662132 s
size matrix [1024] -> 0.01001620 s
size matrix [2048] -> 0.05309319 s
size matrix [4096] -> 0.30087042 s
size matrix [8192] -> 1.26505852 s
size matrix [16384] -> 5.12098169 s

Benchmark execution for batched_dot_bmm:  4 

size matrix [512] -> 0.04772997 s
size matrix [1024] -> 0.01057267 s
size matrix [2048] -> 0.03583956 s
size matrix [4096] -> 0.36060619 s
size matrix [8192] -> 1.29745913 s
size mat

In [8]:
#Generate a file.out with the results.
#Benchmark from pytorch just generate a print from the sdtout, so we need to change the stdout to write it in a file.
#import sys

#original_stdout = sys.stdout # Save a reference to the original standard output

#with open('output_benchmark.out', 'w') as file:
#    sys.stdout = file # Change the standard output to the file we created.
#    #The benchmark execute 5 times to gather data and afterwards 
#    for i in range(0,5):
#        print("Benchmark execution: ",i+1, "\n")
#        benchMark(sizes,dev)

#sys.stdout = original_stdout # Reset the standard output to its original value

In [9]:
#Printing the results
#with open('output_benchmark.out', 'r') as file:
#    for line in file.readlines():
#        print(line)
    