Skip to content

AhmetFurkanDEMIR/NVIDIA-GPU-benchmark

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

NVIDIA GPU benchmark

o-que-e-gpu-placa-de-video

Hello, I have prepared two speed tests for you on NVIDIA GPUs that I have access to. GPUs have been accessed via Google Colab and AWS.

WARNING : Instead of evaluating these GPUs alone, I recommend you to examine them with all their hardware, these GPUs may give different results in different applications or tests at different times.

Graphics processor unit : The graphics processor unit, or GPU for short, is the device used for graphic creation in personal computers, workstations or game consoles. Modern GPUs are extremely efficient at rendering and displaying computer graphics, and their high parallel structures make it more efficient than CPU for complex algorithms. The GPU can be just above the graphics card or integrated into the motherboard.

1 . Speed test : I created four matrices with 10000 rows and 10000 columns on the GPU. First I multiply matrix a and b and assign it to variable y, then I multiply matrix c and d and assign it to variable z, and finally I multiply matrix y and z and assign it to variable x, and I did this operation 1000 times in total.

import time, torch

bas = time.time()

a = torch.rand(10000, 10000, device=torch.device("cuda"))
b = torch.rand(10000, 10000, device=torch.device("cuda"))
c = torch.rand(10000, 10000, device=torch.device("cuda"))
d = torch.rand(10000, 10000, device=torch.device("cuda"))


for i in range(0,1000):

  y = a@b
  z = c@d
  x = y@z

son = time.time()

print("1.test result (second) : " + str(son-bas))

2 . Speed test : With C ++, I manually allocated two places in the GPU memory (10000 rows and 10000 columns) and assigned values to these reserved areas with loops. Then I multiplied these matrices with each other.

You can access the cuda, c and header files written for Test 2 from these links. Links: https://ahmetfurkandemir.s3.amazonaws.com/kernel.cu (kernel.cu), https://ahmetfurkandemir.s3.amazonaws.com/dev_array.h (dev_array.h), https://ahmetfurkandemir.s3.amazonaws.com/kernel.h (kernel.h), https://ahmetfurkandemir.s3.amazonaws.com/matrixmul.cu (matrixmul.c).

#include <iostream>
#include <vector>
#include <stdlib.h>
#include <time.h>
#include <cuda_runtime.h>
#include "kernel.h"
#include "kernel.cu"
#include "dev_array.h"
#include <math.h>
#include <stdio.h>

using namespace std;

int main()
{
    // Perform matrix multiplication C = A*B
    // where A, B and C are NxN matrices
    int N = 10000;
    int SIZE = N*N;

    // Allocate memory on the host
    vector<float> h_A(SIZE);
    vector<float> h_B(SIZE);
    vector<float> h_C(SIZE);

    // Initialize matrices on the host
    for (int i=0; i<N; i++){
        for (int j=0; j<N; j++){
            h_A[i*N+j] = sin(i);
            h_B[i*N+j] = cos(j);
        }
    }

    // Allocate memory on the device
    dev_array<float> d_A(SIZE);
    dev_array<float> d_B(SIZE);
    dev_array<float> d_C(SIZE);

    d_A.set(&h_A[0], SIZE);
    d_B.set(&h_B[0], SIZE);

    matrixMultiplication(d_A.getData(), d_B.getData(), d_C.getData(), N);
    cudaDeviceSynchronize();

    d_C.get(&h_C[0], SIZE);
    cudaDeviceSynchronize();

    printf("END");

    return 0;
}

Let's get to know the contestants :)

There are a total of 4 GPU bananas belonging to the Tesla series, let's examine them in order.

4 X NVIDIA Tesla V100 GPU

a

Screenshot_2020-11-09_21-47-19

Screenshot_2020-11-09_21-48-35

  • Yes, as you can see, we have a machine with 4 Tesla V100 GPUs(It has 64GB of video memory.) in total and we also have a 16-core Intel (R) Xeon (R) CPU.

NVIDIA Tesla P4 GPU

Screenshot_2020-11-09_21-56-45

Screenshot_2020-11-09_21-57-27

Screenshot_2020-11-09_21-57-52

  • We have a Tesla P4 GPU with 7.6GB of video memory, we also have an Intel (R) Xeon (R) CPU with 1 core.

NVIDIA Tesla P100 GPU

Screenshot_2020-11-09_22-06-28

Screenshot_2020-11-09_22-06-34

Screenshot_2020-11-09_22-06-41

  • We have a Tesla P100 GPU with 16.2GB of video memory, we also have a 1-core Intel (R) Xeon (R) CPU.

NVIDIA Tesla T4 GPU

Screenshot_2020-11-09_22-10-35

Screenshot_2020-11-09_22-10-48

Screenshot_2020-11-09_22-10-57

  • We have a Tesla T4 GPU with 15 GB of video memory, we also have a 1-core Intel (R) Xeon (R) CPU.

Test 1 Results

1 . Let's recall what our test is. I created four matrices with 10000 rows and 10000 columns on the GPU. First I multiply matrix a and b and assign it to variable y, then I multiply matrix c and d and assign it to variable z, and finally I multiply matrix y and z and assign it to variable x, and I did this operation 1000 times in total.

import time, torch

bas = time.time()

a = torch.rand(10000, 10000, device=torch.device("cuda"))
b = torch.rand(10000, 10000, device=torch.device("cuda"))
c = torch.rand(10000, 10000, device=torch.device("cuda"))
d = torch.rand(10000, 10000, device=torch.device("cuda"))


for i in range(0,1000):

  y = a@b
  z = c@d
  x = y@z

son = time.time()

print("1.test result (second) : " + str(son-bas))

Performance of GPUs, in seconds

  • 1-) 4 X NVIDIA Tesla V100 GPU : 291.4778277873993 (Second), about 4.85 minutes.

  • 2-) NVIDIA Tesla P4 GPU : 1071.427838563919 (Second), about 17.85 minutes.

  • 3-) NVIDIA Tesla P100 GPU : 479.9311819076538 (Second), about 7.99 minutes.

  • 4-) NVIDIA Tesla T4 GPU : 1293.739860534668 (Second), about 21.56 minutes.

Our machine with 4 X NVIDIA Tesla V100 GPU won this race.

dsa

Test 2 Results

2 . Let's recall what our test is. With C ++, I manually allocated two places in the GPU memory (10000 rows and 10000 columns) and assigned values to these reserved areas with loops. Then I multiplied these matrices with each other.

#include <iostream>
#include <vector>
#include <stdlib.h>
#include <time.h>
#include <cuda_runtime.h>
#include "kernel.h"
#include "kernel.cu"
#include "dev_array.h"
#include <math.h>
#include <stdio.h>

using namespace std;

int main()
{
    // Perform matrix multiplication C = A*B
    // where A, B and C are NxN matrices
    int N = 10000;
    int SIZE = N*N;

    // Allocate memory on the host
    vector<float> h_A(SIZE);
    vector<float> h_B(SIZE);
    vector<float> h_C(SIZE);

    // Initialize matrices on the host
    for (int i=0; i<N; i++){
        for (int j=0; j<N; j++){
            h_A[i*N+j] = sin(i);
            h_B[i*N+j] = cos(j);
        }
    }

    // Allocate memory on the device
    dev_array<float> d_A(SIZE);
    dev_array<float> d_B(SIZE);
    dev_array<float> d_C(SIZE);

    d_A.set(&h_A[0], SIZE);
    d_B.set(&h_B[0], SIZE);

    matrixMultiplication(d_A.getData(), d_B.getData(), d_C.getData(), N);
    cudaDeviceSynchronize();

    d_C.get(&h_C[0], SIZE);
    cudaDeviceSynchronize();

    printf("END");

    return 0;
}

Test-2a, Performance of GPUs, in seconds :

Test 2-a, let's first see which GPU will compile the Cuda file named matrixmul.cu. The files are compiled with nvcc (Cuda compiler).

# compilation test
bas = time.time()
!nvcc matrixmul.cu
son = time.time()

print("2.test-a result (second) : " + str(son-bas))
  • 1-) 4 X NVIDIA Tesla V100 GPU : 1.413379192352295 (Second).

  • 2-) NVIDIA Tesla P4 GPU : 2.9613592624664307 (Second).

  • 3-) NVIDIA Tesla P100 GPU : 1.4539947509765625 (Second).

  • 4-) NVIDIA Tesla T4 GPU : 1.6754465103149414 (Second).

The machine with 4X NVIDIA Tesla V100 GPU won the race 2-a by a small margin. tst2

Test-2b, Performance of GPUs, in seconds :

Test 2-b, which GPU will be able to finish running the compiled file first.

# run the compiled file, test
bas = time.time()
!./a.out
son = time.time()

print("2.test-b result (second) : " + str(son-bas))
  • 1-) 4 X NVIDIA Tesla V100 GPU : 9.453376293182373 (Second).

  • 2-) NVIDIA Tesla P4 GPU : 8.686630487442017 (Second).

  • 3-) NVIDIA Tesla P100 GPU : 8.072553873062134 (Second).

  • 4-) NVIDIA Tesla T4 GPU : 8.99604868888855 (Second).

The machine with 4X NVIDIA Tesla P100 GPU won the race 2-b by a small margin.

dsaddd

My own conclusions based on these results

  • According to my observations, in short and simple operations, all GPUs, regardless of GPU video memory and CPU, can finish in a very short and close time.

  • But in long and laborious calculations, high GPU memory and a good CPU allow it to stand out from other competitors.

  • If we look at the graphics and results, today's winner is 4 X NVIDIA Tesla V100 GPUs :).

  • WARNING : Instead of evaluating these GPUs alone, I recommend you to examine them with all their hardware, these GPUs may give different results in different applications or tests at different times.