# Benchmark for TPU using pytorch

This code is going to do some computational test about the performance that a TPU can obtain. It's an adaptation from my previuous benchmarks using pytorch. However, the script to use pytorch-xla (the module that uses the TPU) it's only available to use with pytorch 1.6, it's not available to use it with current version (1.7), so, the BenchMark module from pytorch it's not included and it has to be replaced it by using the timeit module.

Using timeit module made that executions can have a warmp up delay of a 2 us approximately. Besides, there's my own method for timing, and it'is going to be used to analyse the timeit library from pytorch

In [1]:
!curl https://raw.githubusercontent.com/pytorch/xla/master/contrib/scripts/env-setup.py -o pytorch-xla-env-setup.py
!python pytorch-xla-env-setup.py  --apt-packages libomp5 libopenblas-dev # --version=pytorch-1.8
!pip install openpyxl

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  5116  100  5116    0     0  63160      0 --:--:-- --:--:-- --:--:-- 63160
Updating... This may take around 2 minutes.
Updating TPU runtime to pytorch-dev20200515 ...
Found existing installation: torch 1.7.0
Uninstalling torch-1.7.0:
Done updating TPU runtime
  Successfully uninstalled torch-1.7.0
Found existing installation: torchvision 0.8.1
Uninstalling torchvision-0.8.1:
  Successfully uninstalled torchvision-0.8.1
Copying gs://tpu-pytorch/wheels/torch-nightly+20200515-cp37-cp37m-linux_x86_64.whl...

Operation completed over 1 objects/91.0 MiB.                                     
Copying gs://tpu-pytorch/wheels/torch_xla-nightly+20200515-cp37-cp37m-linux_x86_64.whl...

Operation completed over 1 objects/119.5 MiB.                                    
Copying gs://tpu-pytorch/wheels/torchvision-nightly+202

In [2]:
# imports pytorch
import torch

print(torch.__version__)

# imports the torch_xla package
import torch_xla
import torch_xla.core.xla_model as xm
import platform
import os
#Importing Libraries needed for use torch
import timeit
#import torch.utils.benchmark as benchmark #torch_xla it is not compatible with 1.7, where it is the benchmark library

1.6.0a0+bf2bbd9


In [3]:
#Functions obtained from Torch Webpages por PyTorch Benchmarks
def batched_dot_mul_sum(a, b):
    '''Computes batched dot by multiplying and summing'''
    return a.mul(b).sum(-1)


def batched_dot_bmm(a, b):
    '''Computes batched dot by reducing to bmm'''
    a = a.reshape(-1, 1, a.shape[-1])
    b = b.reshape(-1, b.shape[-1], 1)
    return torch.bmm(a, b).flatten(-3)

In [4]:
def benchMark(sizes,dev):   
    for n in sizes:
        x = torch.ones((n, n))
        x = x.to(device=dev)
        t0 = timeit.Timer(
        stmt='batched_dot_mul_sum(x, x)',
        setup='from __main__ import batched_dot_mul_sum',
        globals={'x': x})

        t1 = timeit.Timer(
        stmt='batched_dot_bmm(x, x)',
        setup='from __main__ import batched_dot_bmm',
        globals={'x': x})

        print('size of square matrix: ',n)
        print(f'mul_sum(x, x):  {t0.timeit(5) / 100 * 1e6:>5.1f} us')
        print(f'bmm(x, x):      {t1.timeit(5) / 100 * 1e6:>5.1f} us\n')

In [5]:
dev = xm.xla_device()
sizes = [512,1024,2048,4096,8192,16384] # maximun size withou running out memory -> 32768

for i in range(0,5):
        print("Benchmark execution: ",i+1, "\n")
        benchMark(sizes,dev)

Benchmark execution:  1 

size of square matrix:  512
mul_sum(x, x):    7.7 us
bmm(x, x):        4.4 us

size of square matrix:  1024
mul_sum(x, x):    2.8 us
bmm(x, x):        2.5 us

size of square matrix:  2048
mul_sum(x, x):    8.0 us
bmm(x, x):        6.2 us

size of square matrix:  4096
mul_sum(x, x):    7.0 us
bmm(x, x):        5.0 us

size of square matrix:  8192
mul_sum(x, x):    9.3 us
bmm(x, x):        5.4 us

size of square matrix:  16384
mul_sum(x, x):   10.7 us
bmm(x, x):        4.3 us

Benchmark execution:  2 

size of square matrix:  512
mul_sum(x, x):    5.6 us
bmm(x, x):        3.3 us

size of square matrix:  1024
mul_sum(x, x):    6.4 us
bmm(x, x):        4.8 us

size of square matrix:  2048
mul_sum(x, x):    5.5 us
bmm(x, x):        3.5 us

size of square matrix:  4096
mul_sum(x, x):    6.2 us
bmm(x, x):        4.7 us

size of square matrix:  8192
mul_sum(x, x):    8.3 us
bmm(x, x):        6.1 us

size of square matrix:  16384
mul_sum(x, x):    1.4 us
bmm(x, x):    

In [6]:
def ownBenchmark(sizes,writerCSV,operation):
    dev = xm.xla_device()
    for i in range(0,5):
        print("\nBenchmark execution for ",operation,": ",i+1, "\n")
        for n in sizes:
            timeInit = time.time()
            xCPU = torch.ones(n, n)
            xTPU = xCPU.to(device=dev)
            if(operation == "mul_sum"):
                batched_dot_mul_sum(xTPU,xTPU)
            else:
                batched_dot_bmm(xTPU,xTPU)
            timeFinish = time.time()
            print(f"size matrix [{n}] -> {(timeFinish - timeInit):0.8f} s")
            writer.writerow([operation, n, i+1,(timeFinish - timeInit)])

In [7]:
#Now my own benchmark. With this i going to measure Speed ups and efficiencies. The pytorch benchmark give us too good results to be true...
# We are going to use the library time from python and do the syncronizations to the gpu device
import time #-> time.time() returns the time in seconds
import csv #We are going to generate an csv with the results to work with pandas

sizes = [512,1024,2048,4096,8192,16384] # maximun size withou running out memory -> 32768

with open('results_tpu.csv', 'w', newline='') as file:
    writer = csv.writer(file)
    writer.writerow(["operation", "sizeMatrix", "numberCase","timeElpased"])
    ownBenchmark(sizes,writer,"mul_sum")
    ownBenchmark(sizes,writer,"bmm")


Benchmark execution for  mul_sum :  1 

size matrix [512] -> 0.00530362 s
size matrix [1024] -> 0.01515794 s
size matrix [2048] -> 0.04899144 s
size matrix [4096] -> 0.32499886 s
size matrix [8192] -> 1.30783510 s
size matrix [16384] -> 5.07148576 s

Benchmark execution for  mul_sum :  2 

size matrix [512] -> 0.04917693 s
size matrix [1024] -> 0.01009464 s
size matrix [2048] -> 0.04144573 s
size matrix [4096] -> 0.35329938 s
size matrix [8192] -> 1.37029552 s
size matrix [16384] -> 4.78269672 s

Benchmark execution for  mul_sum :  3 

size matrix [512] -> 0.05050397 s
size matrix [1024] -> 0.01043797 s
size matrix [2048] -> 0.05476999 s
size matrix [4096] -> 0.31031942 s
size matrix [8192] -> 1.27683711 s
size matrix [16384] -> 5.03634143 s

Benchmark execution for  mul_sum :  4 

size matrix [512] -> 0.05379534 s
size matrix [1024] -> 0.00998259 s
size matrix [2048] -> 0.04655671 s
size matrix [4096] -> 0.28967881 s
size matrix [8192] -> 1.26316905 s
size matrix [16384] -> 4.8074545

In [8]:
#Generate the excel and giving a little of format
#TODO include the calculate of FLOPS in excel/dataFrame
import pandas as pd

df = pd.read_csv("results_tpu.csv")
df.info()

df_sorted = df.sort_values(by=["operation","numberCase"])

df_sorted.to_excel("results_tpu_excel.xlsx")

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 60 entries, 0 to 59
Data columns (total 4 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   operation    60 non-null     object 
 1   sizeMatrix   60 non-null     int64  
 2   numberCase   60 non-null     int64  
 3   timeElpased  60 non-null     float64
dtypes: float64(1), int64(2), object(1)
memory usage: 2.0+ KB
