<div align="center"><h1> Hackathon HPC SENAI CIMATEC <br>
Espacialização de Relevo </h1></div>

**Murilo Boratto**$^1$

$^1$ Supercomputing Center SENAI CIMATEC, Salvador, Bahia, Brazil

## Instalação das Bibliotecas no Colab

Este é o passo a passo de como instalar as APIs em ambiente virtual do Colab, utilizando a implementação open source.

In [1]:
!sudo apt install mpich libopenblas-dev

Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
libopenblas-dev is already the newest version (0.3.20+ds-1).
The following additional packages will be installed:
  hwloc-nox libmpich-dev libmpich12 libslurm37
Suggested packages:
  mpich-doc
The following NEW packages will be installed:
  hwloc-nox libmpich-dev libmpich12 libslurm37 mpich
0 upgraded, 5 newly installed, 0 to remove and 49 not upgraded.
Need to get 14.2 MB of archives.
After this operation, 102 MB of additional disk space will be used.
Get:1 http://archive.ubuntu.com/ubuntu jammy/universe amd64 libslurm37 amd64 21.08.5-2ubuntu1 [542 kB]
Get:2 http://archive.ubuntu.com/ubuntu jammy-updates/universe amd64 hwloc-nox amd64 2.7.0-2ubuntu1 [205 kB]
Get:3 http://archive.ubuntu.com/ubuntu jammy/universe amd64 libmpich12 amd64 4.0-3 [5,866 kB]
Get:4 http://archive.ubuntu.com/ubuntu jammy/universe amd64 mpich amd64 4.0-3 [197 kB]
Get:5 http://archive.ubuntu.com/ubuntu jammy/universe 

In [57]:
!sudo apt-get install liblapack-dev


Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
liblapack-dev is already the newest version (3.10.0-2ubuntu1).
0 upgraded, 0 newly installed, 0 to remove and 49 not upgraded.


### 1 GPU (CUDA)

In [59]:
%%writefile r3d-1GPU.cu
#include <omp.h>
#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include <cuda_runtime.h>  // Para CUDA Runtime

#define Mxx 5026
#define Mxy 5026
#define INPUT "mde.pnz"
#define OUTPUT "output.txt"
#define ANOVA "anova.txt"

// Protótipos das funções
extern "C" {
    void dgesv_(int *N, int *NRHS, double *A, int *lda, int *ipiv, double *B, int *ldb, int *info);
}

long int n = 0;  // Número global de pontos
double zm = 0.0;  // Média de z

void dados(double *x, double *y, double *z);
void matrizes(double *A, double *B, double *x, double *y, double *z, int N, int r);
void sistema_Lapack(double *A, double *B, int N);
void anova(double *A, double *x, double *y, double *z, int N, int r);

int main(int argc, char **argv) {
    int r = atoi(argv[1]);  // Degree of the polynomial
    int s = r;               // Simplify Polynomial Degree
    int N = (r + 1) * (s + 1);  // Number of coefficients of polynomial Landform
    int MAX = (Mxx + 1) * (Mxy + 1);

    double *A = (double*)malloc(sizeof(double) * N * N);
    double *B = (double*)malloc(sizeof(double) * N);
    double *x = (double*)malloc(sizeof(double) * MAX);
    double *y = (double*)malloc(sizeof(double) * MAX);
    double *z = (double*)malloc(sizeof(double) * MAX);

    double t1, t2;

    // Inicializa o tempo
    t1 = omp_get_wtime();

    // Etapas do processo
    dados(x, y, z);
    matrizes(A, B, x, y, z, N, r);
    sistema_Lapack(A, B, N);
    anova(B, x, y, z, N, r);

    // Finaliza o tempo
    t2 = omp_get_wtime();

    printf("%d\t%5.2f\n", r, t2 - t1);

    // Libera a memória alocada
    free(A);
    free(B);
    free(x);
    free(y);
    free(z);

    return 0;
}

void dados(double *x, double *y, double *z) {
    int col, row, count = -1;
    FILE *f;
    n = 0;  // Inicializa n aqui

    if ((f = fopen(INPUT, "r")) == NULL) {
        printf("\n Erro I/O\n");
        exit(1);
    }

    for (row = 0; row < Mxy; ++row) {
        for (col = 0; col < Mxx; ++col) {
            float h;
            int result = fscanf(f, "%f", &h);
            count++;
            if (count % 10 != 0 || h <= 0) continue;
            x[n] = row;
            y[n] = col;
            z[n] = h / 2863.0;
            n++;  // Incrementa n para o próximo ponto
        }
    }

    // Calcula a média de z para armazenar em zm
    double sum = 0.0;
    for (int i = 0; i < n; i++) {
        sum += z[i];
    }
    zm = sum / n;  // Calcula a média

    fclose(f);
}

void matrizes(double *A, double *B, double *x, double *y, double *z, int N, int r) {
    int i, l, c;
    int s = r;

    for (l = 0; l < N; ++l) {
        for (c = 0; c < N; ++c) {
            A[l + c * N] = 0.0;

            if (c == 0)
                B[l] = 0.0;

            for (i = 0; i < n; ++i) {  // Usando n aqui
                A[l + c * N] += pow(x[i], (int)(l / (s + 1)) + (int)(c / (s + 1))) * pow(y[i], l % (r + 1) + c % (r + 1));
                if (c == 0)
                    B[l] += z[i] * pow(x[i], (int)(l / (s + 1))) * pow(y[i], l % (r + 1));
            }
        }
    }
}

void sistema_Lapack(double *A, double *B, int N) {
    int NRHS = 1;
    int info;
    int *ipiv = (int*)malloc(sizeof(int) * (10 * N));
    dgesv_(&N, &NRHS, A, &N, ipiv, B, &N, &info);
}

void anova(double *A, double *x, double *y, double *z, int N, int r) {
    int i, glReg, glR, glT, c, l;
    double SQReg, SQR, SQT, QMReg, QMR, R2, F, ze;
    FILE *f;
    int s = r;

    SQR = SQReg = 0.0;
    glReg = N;
    glR = n - 2 * N;
    glT = n - N;

    for (i = 0; i < n; ++i) {  // Usando n aqui
        ze = 0.0;
        for (c = 0; c < r + 1; ++c)
            for (l = 0; l < s + 1; ++l)
                ze += A[c + l * (r + 1)] * pow(x[i], c) * pow(y[i], l);

        SQReg += (ze - zm) * (ze - zm);
        SQR += (z[i] - ze) * (z[i] - ze);
    }

    SQT = SQReg + SQR;
    QMReg = SQReg / glReg;
    QMR = SQR / glR;
    F = QMReg / QMR;
    R2 = SQReg / SQT;

    if ((f = fopen(ANOVA, "w")) == NULL) {
        printf("\n Error I/O\n");
        exit(2);
    }

    fprintf(f, " \n\n\n\n");
    fprintf(f, " ANOVA\n");
    fprintf(f, " =================================================\n");
    fprintf(f, " FV           gl      SQ         QM          F    \n");
    fprintf(f, " =================================================\n");
    fprintf(f, " Regression  %5d  %12e  %12e  %12e\n", glReg, SQReg, QMReg, F);
    fprintf(f, " Residue     %5d  %12e  %12e      \n", glR, SQR, QMR);
    fprintf(f, " -------------------------------------------------\n");
    fprintf(f, " Total      %5d  %12e            \n", glT, SQT);
    fprintf(f, " =================================================\n");
    fprintf(f, " R^2= %12e                        \n", R2);

    for (c = 0; c < r + 1; ++c)
        for (l = 0; l < s + 1; ++l) {
            fprintf(f, "+x^%d*y^%d*\t%12g\n", c, l, A[c + l * (r + 1)]);
        }

    fclose(f);
}

Overwriting r3d-1GPU.cu


In [61]:
!nvcc r3d-1GPU.cu -o r3d-1GPU -Xcompiler="-fopenmp -O3 -lm" -lcublas -lcurand -lblas -llapack

   int i;
       ^




In [62]:
!./r3d-1GPU 2 64


 n (number of operations for point in the matrix) = 1365248
2	860.28


## Análise Experimental

### I) Validação com Valores Pequenos

#### Parâmetros Ótimos de Execução

1. OpenMP = **A** Threads
2. MPI = **B** nós + **C** Processos
3. MPI + OpenMP = **D** nós + **E** Processos + **F** Threads
4. CUDA = G1D B1DT1D (**G** * 32, 1024)

### Tempos de execução em segundos das aplicações

|  Grau | Sequencial Lapack| OpenMP | MPI  | Híbrido | CUDA
| ----------| ---------- | ------ | ---  | ------- | ----
| 2         |            |        |      |         |  
| 4         |            |        |      |         |  
| 6         |            |        |      |         |  
| 8         |            |        |      |         |  
| 10        |            |        |      |         |  

### Speedups

|  Grau  |  OpenMP  | MPI     | Híbrido       | CUDA
| ---------   | ------   | ------  | -------       | ----
| 2           |          |         |               |           
| 4           |          |         |               |           
| 6           |          |         |               |           
| 8           |          |         |               |           
| 10          |          |         |               |           

### II) Análise de Desempenho com um valor alto - Grau do Polinômio = `20`

#### Parâmetros Ótimos de Execução

1. OpenMP = **A** Threads
2. MPI = **B** nós + **C** Processos
3. MPI + OpenMP = **D** nós + **E** Processos + **F** Threads
4. CUDA = G1D B1DT1D (**G** * 32, 1024)

### Tempo de execução em segundos das aplicações (referência @muriloboratto)

|  Grau        | Sequencial Lapack| OpenMP  |  CUDA (1 GPU)
| -------------| ---------------- | ------- |  -----------
| 10           | 2060.14          | 62.55   |      14.72
| 20           | 27986.75         | 744.77  |      70.03


### Speedup (referência @muriloboratto)

|  Grau        |  OpenMP    |  CUDA (1GPU)
| -------------|  --------  |  -----------
| 10           |  32X       |     139X
| 20           |  37X       |     399X


## Conclusões

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.

## Referências Biliográficas

* G. Coulouris, J. Dollimore, T. Kindberg, G.Blair. Distributed Systems: Concepts and Design, Fifth Edition, Pearson, 2011.

* S.Tanenbaum, M. Steen, Distributed Systems: Principles and Paradigms, Second Edition, Pearson, 2006.

* David A. Patterson and John L. Hennessy. Computer Organization and Design: The Hardware/Software Interface. Morgan Kaufmann, 5th Edition, 2013.

* An Introduction to Parallel Programming by Peter S. Pacheco. Morgan Kauffman.

* W. C. Barbosa, An introduction to distributed algorithms, MIT Press, 1997. N. Lynch, Distributed Algorithms, Mit Press, 1996 e Introduction to Distributed Algorithms, Gerard Tel, Cabribridge U. Press, 1994.