FYI: I'm just trying to figure out how to solve the problem given from the lectures, here's just my thoughts on it, and I assume it may not be the correct/only answer.
The final project is to solve a problem given from the slides using cuda programming. Problem is here:
Let's try to break this down step by step to understand what should we actually do.
First things first, we need 3 histograms DD
, DR
and RR
.
According to Wikipedia, a histogram is a diagram is an approximate representation of the distribution of numerical data, which look like this:
Let's say there are 100 entries of data ( in this case integers representing arivals ) in the collection. Each rectengle represents the frequency of data within a range, for example the first rectengle at the right side of 2 on the x/horizontal axis reprecents the numbers of arrivals between 2 and 2.5 minute, and all the heights of the rectangles sum up to be 100.
Here's some terminologies:
- The width (on the x/horizontal) of each rectangle is called bin width, marked with
h
, then the number of bins is marked withk
:
- If we let
n
be the total number of observations andk
be the total number of bins, the height of rectanglei
(labled withm_i
) meet the following conditions:
In our case, the rectangle DD
stores frequncies of angles. Each on of theses angle is the angle between 2 vectors, aka 2 different galaxies given the real measured galaxies data source. These vectors are given in spherical coordinates, each of them has 2 components, according to the slides:
And for each 2 of the vectors we can get the cosine of their angle if we divide the dot product by the product of their length:
Therefore I believe the function probably looks like this:
float calculateAngularDistance(float g1_ra, float g1_dec, float g2_ra, float g2_dec) {
// turning arc minutes to degree
float ra1 = g1_ra * M_PI / 180.0;
float dec1 = g1_dec * M_PI / 180.0;
float ra2 = g2_ra * M_PI / 180.0;
float dec2 = g2_dec * M_PI / 180.0;
// calculate angular distance
float delta_ra = ra2 - ra1;
float cos_c = sin(dec1) * sin(dec2) + cos(dec1) * cos(dec2) * cos(delta_ra);
float c = acos(cos_c);
}
__global__ void calculateHistograms(float* d_ra_real, float * d_decl_real, float* r_ra_sim, float* r_decl_sim, int* dd, int* dr, int* rr, int numD, int numR) {
int index = threadIdx.x + blockIdx.x * blockDim.x;
int stride = blockDim.x * gridDim.x;
for (int i = index; i < numD; i += stride) {
for (int j = i + 1; j < numD; j++) {
int bin = (int)( calculateAngularDistance(d_ra_real[i], d_decl_real[i], d_ra_real[j], d_decl_real[j]) / 0.25 );
atomicAdd(&dd[bin], 1);
}
for (int j = 0; j < numR; j++) {
int bin = (int)( calculateAngularDistance(d_ra_real[i], d_decl_real[i], r_ra_sim[j], r_decl_sim[j]) / 0.25 );
atomicAdd(&dr[bin], 1);
}
}
for (int i = index; i < numR; i += stride) {
for (int j = i + 1; j < numR; j++) {
int bin = (int)( calculateAngularDistance(r_ra_sim[i], r_decl_sim[i], r_ra_sim[j], r_decl_sim[j]) / 0.25 );
atomicAdd(&rr[bin], 1);
}
}
}
But the thing is such embedded loop structure may not be optimal because it didn't use the calculated threadID as the index to access the entries in arrays. If we can somehow replace the loops by multi threading it could be better. This bring the following question.
This is the device info of dione:
Found 4 CUDA devices
Device Tesla V100-PCIE-16GB device 0
compute capability = 7.0
totalGlobalMemory = 16.94 GB
l2CacheSize = 6291456 B
regsPerBlock = 65536
multiProcessorCount = 80
maxThreadsPerMultiprocessor = 2048
sharedMemPerBlock = 49152 B
warpSize = 32
clockRate = 1380.00 MHz
maxThreadsPerBlock = 1024
asyncEngineCount = 7
f to lf performance ratio = 2
maxGridSize = 2147483647 x 65535 x 65535
maxThreadsDim in thread block = 1024 x 1024 x 64
concurrentKernels = yes
deviceOverlap = 1
Concurrently copy memory/execute kernel
Device Tesla V100-PCIE-16GB device 1
compute capability = 7.0
totalGlobalMemory = 16.94 GB
l2CacheSize = 6291456 B
regsPerBlock = 65536
multiProcessorCount = 80
maxThreadsPerMultiprocessor = 2048
sharedMemPerBlock = 49152 B
warpSize = 32
clockRate = 1380.00 MHz
maxThreadsPerBlock = 1024
asyncEngineCount = 7
f to lf performance ratio = 2
maxGridSize = 2147483647 x 65535 x 65535
maxThreadsDim in thread block = 1024 x 1024 x 64
concurrentKernels = yes
deviceOverlap = 1
Concurrently copy memory/execute kernel
Device Tesla V100-PCIE-16GB device 2
compute capability = 7.0
totalGlobalMemory = 16.94 GB
l2CacheSize = 6291456 B
regsPerBlock = 65536
multiProcessorCount = 80
maxThreadsPerMultiprocessor = 2048
sharedMemPerBlock = 49152 B
warpSize = 32
clockRate = 1380.00 MHz
maxThreadsPerBlock = 1024
asyncEngineCount = 7
f to lf performance ratio = 2
maxGridSize = 2147483647 x 65535 x 65535
maxThreadsDim in thread block = 1024 x 1024 x 64
concurrentKernels = yes
deviceOverlap = 1
Concurrently copy memory/execute kernel
Device Tesla V100-PCIE-16GB device 3
compute capability = 7.0
totalGlobalMemory = 16.94 GB
l2CacheSize = 6291456 B
regsPerBlock = 65536
multiProcessorCount = 80
maxThreadsPerMultiprocessor = 2048
sharedMemPerBlock = 49152 B
warpSize = 32
clockRate = 1380.00 MHz
maxThreadsPerBlock = 1024
asyncEngineCount = 7
f to lf performance ratio = 2
maxGridSize = 2147483647 x 65535 x 65535
maxThreadsDim in thread block = 1024 x 1024 x 64
concurrentKernels = yes
deviceOverlap = 1
Concurrently copy memory/execute kernel
Using CUDA device 0
Assuming input data is given in arc minutes!
data_100k_arcmin.dat contains 100000 galaxies
Found 4 CUDA devices
Device Tesla V100-PCIE-16GB device 0
compute capability = 7.0
totalGlobalMemory = 16.94 GB
l2CacheSize = 6291456 B
regsPerBlock = 65536
multiProcessorCount = 80
maxThreadsPerMultiprocessor = 2048
sharedMemPerBlock = 49152 B
warpSize = 32
clockRate = 1380.00 MHz
maxThreadsPerBlock = 1024
asyncEngineCount = 7
f to lf performance ratio = 2
maxGridSize = 2147483647 x 65535 x 65535
maxThreadsDim in thread block = 1024 x 1024 x 64
concurrentKernels = yes
deviceOverlap = 1
Concurrently copy memory/execute kernel
Device Tesla V100-PCIE-16GB device 1
compute capability = 7.0
totalGlobalMemory = 16.94 GB
l2CacheSize = 6291456 B
regsPerBlock = 65536
multiProcessorCount = 80
maxThreadsPerMultiprocessor = 2048
sharedMemPerBlock = 49152 B
warpSize = 32
clockRate = 1380.00 MHz
maxThreadsPerBlock = 1024
asyncEngineCount = 7
f to lf performance ratio = 2
maxGridSize = 2147483647 x 65535 x 65535
maxThreadsDim in thread block = 1024 x 1024 x 64
concurrentKernels = yes
deviceOverlap = 1
Concurrently copy memory/execute kernel
Device Tesla V100-PCIE-16GB device 2
compute capability = 7.0
totalGlobalMemory = 16.94 GB
l2CacheSize = 6291456 B
regsPerBlock = 65536
multiProcessorCount = 80
maxThreadsPerMultiprocessor = 2048
sharedMemPerBlock = 49152 B
warpSize = 32
clockRate = 1380.00 MHz
maxThreadsPerBlock = 1024
asyncEngineCount = 7
f to lf performance ratio = 2
maxGridSize = 2147483647 x 65535 x 65535
maxThreadsDim in thread block = 1024 x 1024 x 64
concurrentKernels = yes
deviceOverlap = 1
Concurrently copy memory/execute kernel
Device Tesla V100-PCIE-16GB device 3
compute capability = 7.0
totalGlobalMemory = 16.94 GB
l2CacheSize = 6291456 B
regsPerBlock = 65536
multiProcessorCount = 80
maxThreadsPerMultiprocessor = 2048
sharedMemPerBlock = 49152 B
warpSize = 32
clockRate = 1380.00 MHz
maxThreadsPerBlock = 1024
asyncEngineCount = 7
f to lf performance ratio = 2
maxGridSize = 2147483647 x 65535 x 65535
maxThreadsDim in thread block = 1024 x 1024 x 64
concurrentKernels = yes
deviceOverlap = 1
Concurrently copy memory/execute kernel
Using CUDA device 0
There are 4 Tesla V100-PCIE-16GB, maximun number of threads per block is 1024, and the shared memory in each thread block is 49152 B (4 MB).
The input data contains N = 100000
real galaxy coordinates and N = 100000
random generated galaxy coordinates. Because we would calculate the angle between each 2 coordinates inside the set, for histogram DD
or RR
, the sum of each bin would be the combination counts of selecting 2 from N:
Therefore I believe we can replace the outer loop with a if statement to see if the index is within the range of N = 100000
( or NoofReal
/ NoofSim
/ numD
/ numR
, the same thing ):
int threadsInBlock = 256;
int blocksInGrid = ( max( NoofReal, NoofSim ) + threadsInBlock - 1) / threadsInBlock;
calculateHistograms<< blocksInGrid, threadsInBlock >>( ra_real_gm, decl_real_gm, ra_sim_gm, decl_sim_gm, histogramDD_gm, histogramDR_gm, histogramRR_gm, NoofReal, NoofSim );
And the function calculateHistograms()
for calculating histogram can be modified like this:
__global__ void calculateHistograms(float* d_ra_real, float * d_decl_real, float* r_ra_sim, float* r_decl_sim, int* dd, int* dr, int* rr, int numD, int numR) {
int index = threadIdx.x + blockIdx.x * blockDim.x;
if( index < numD ) {
for (int j = index + 1; j < numD; j++) {
int bin = (int)( calculateAngularDistance(d_ra_real[index], d_decl_real[index], d_ra_real[j], d_decl_real[j]) / 0.25 );
atomicAdd(&dd[bin], 1);
}
for (int j = 0; j < numR; j++) {
int bin = (int)( calculateAngularDistance(d_ra_real[index], d_decl_real[index], r_ra_sim[j], r_decl_sim[j]) / 0.25 );
atomicAdd(&dr[bin], 1);
}
}
if( index < numR ) {
for (int j = index + 1; j < numR; j++) {
int bin = (int)( calculateAngularDistance(r_ra_sim[index], r_decl_sim[index], r_ra_sim[j], r_decl_sim[j]) / 0.25 );
atomicAdd(&rr[bin], 1);
}
}
}