multi-node SA/UA/RSAMG (#212)

* PMIS MPI support (#100) * mpi to hip backend * MPI enablement for PMIS * MPI RS example * clang-format * example * fix * global Ext+I interpolation (#112) * unused function * Added PM as argument for aggregation function * added pm for each level to MG class * Added PM as argument for aggregation function #2 * Added PM as argument for aggregation function #3 * RS Ext+I added to global matrix * modified example to dump some data for validation - work in progress * RS Ext+I function added to headers * RS Ext+I HIP implementation * RS Ext+I host implementation * Improved PM * global Ext+I kernel update * some multinode improvements (#118) * added some more useful guards to parallel manager * added CopyFromHostData, CopyToHostData, ExclusiveSum and Sort functionality for vector class ; moved boundary information from vector to matrix ; added GetFormat() to GlobalMatrix class * clang-format * P and R should be OperatorType, not LocalMatrix * clang-format * check if PM is valid when required * added extra function for triple matrix product for simplicity * clang-format * clang-format * skip free when ptr is nullptr * fix memory leak in BaseAMG class * rocsparse_csrgeam added * renamed csr ext+i kernel because it is generally usable * allowing CSR zero matrices with row_offset != nullptr, as well as zero vectors with size == 0 * clang-format * duplicated row column entries throw a warning * copy_x2x() functions added for readability and simplicity * search and replace memcpy with copy fct * search and replace memcpy with copy fct #2 ; fix for random csr generator to not generate duplicated row col entries * clang-format * fixes * those asserts are wrong * major version bump * OpenMP parallel loop threshold need to be int64_t in order to work with larger structures * Allow basic structures with 64bit entries, e.g. for global indices * nnz should be 64bit ; also restructured RSAMG for readability * vector size need to be 64bit locally - also added inclusive and exclusive sum functionality * 64bit sizes for stencils * host vector implementation changes for 64bit sizes and in/exclusive sum ; host I/O changed to always write 64bit sizes * host stencil 64bit changes * max residual index changed to 64bit accordingly * solvers adjusted for 64bit nnz * int64_t to double conversion * allocation size should always be 64bit ; also added copy_h2h() for simplicity * long and long long communication support added * cleaned up types ; IndexType2 was a stupid name anyway * removed deprecations (major release); enabled global structure support in RSAMG * major changes to PM; added guards for transfers; removed deprecations; fixed int overflows; functionality to generate a PM from global ghost column ids, and a parent PM * matrix conversions 64bit nnz support with guards * host matrix I/O changed to always write 64bit sizes ; backward compatible * host matrix implementations changed to 64bit nnz * RSAMG restructured - global communication should not happen in local implementations ; switched to 64bit sizes * host CSR matrix implementation * hip implementation ; added copy_d2h/h2d/d2d for simplicity, with async flag * adjusted unit tests to removed deprecated functions * RSAMG MPI example updated * fixed sanity assert * doc update * example should work with only 1 process * global routines should work with single process * global transpose operator * using copy_h2h() * _rocalution_sync should force a global barrier, too * improved asynchronous apply / comm / halo apply * accelerator must be available for pinned alloc/free * fixing few compiler warnings * readability * removed the flood of printf on multi gpu systems * adjusted openmp nested (deprecation) to v5.0 * weak scaling examples * distributed laplacian generator * updated rsamg example * updated rsamg mpi example * should use OperatorType, nothing else * fixed RSDirectInterpolation(); fixed const PM issue * updated unit tests * types.hpp generated by cmake ; CSR(64/32) added on host ; moved RSPMIS communication into global matrix class * removed old types.hpp * initial implementation for unordered set and map on hip backend * outsourced RSAMG to improve compilation performance; added async communication for multinode; moved multinode rspmis into globalmatrix; outsourced atomics * clang-format * fix for streams when not building for mpi * SA amg merge fix * fixed missing shared memory size * clang-format * clang-format * typo * clang format * add blockdim to UAAMG benchmark * adjusting unit tests for removed deprecated functions * clang format * test fix * clang-format * std::sort required algorithm header * fixing merge error * merge fix #2 * header cleaned up * header cleaned up #2 * fix issue with HIP not being found * free_pinned() does nothing on nullptr * global triple matrix product * proper error message when coarsening fails * fixed a bug in global triplematrixproduct * fixed a typo * fixed compilation issue when HIP=off * fixes COO and CSR conversions on both host and device, and ELL on host only (#211) * empty matrix conversion fix * host fallback fix for rsamg and triplematprod * Fix documentation failures (#214) Co-authored-by: jsandham <james.sandham@amd.com> * Add Smoothed Aggregation to amgmpi branch (#213) * Adding global aggregation to SAAMG (#166) Co-authored-by: jsandham <james.sandham@amd.com> * Add MPI support for global prolongation to SAAMG (#171) Co-authored-by: jsandham <james.sandham@amd.com> * Add MPI support for SAAMG global transpose (#172) * Add MPI support for SAAMG global transpose * Fix failures in greedy aggregation caused by unfilled aggregate_root_nodes array --------- Co-authored-by: jsandham <james.sandham@amd.com> * Add MPI unsmoothed aggregation (#174) Co-authored-by: jsandham <james.sandham@amd.com> * Adding debug printing to test triple product * Adding debug print statements for testing * Adding more debug printing * Testing * Testing * Testing * Testing * Testing * Testing * Testing * Testing * Testing * Fix floating point fault caused by division by zero * Testing * Testing * Testing * Testing * Testing * Testing * Testing * Fix failures in local matrix when max_nnz_per_row is too high * Testing * Fix bug where we were not using a large enough hash table size * Fix discrepency in host and hip assert in ExtractSubMatrix * Fixing hangs in multinode hip backend * Fix RSAMG documentation warnings * Testing MPI uaamg * Fix testing_local_matrix failure * Remove comments and temporary testing code * PR fixes * PR fixes * PR fixes * Clang formatting --------- Co-authored-by: jsandham <james.sandham@amd.com> * removed unused variables * Add back functions that cannot be removed until next major release (#216) Co-authored-by: jsandham <james.sandham@amd.com> * fix for very large sizes where local ext matrix exceeds int32 * Remove print statements from saamg testing file --------- Co-authored-by: James Sandham <33790278+jsandham@users.noreply.github.com> Co-authored-by: jsandham <james.sandham@amd.com>
ROCm · Nov 21, 2023 · 5a91521 · 5a91521
1 parent 045b889
commit 5a91521
Show file tree

Hide file tree

Showing 62 changed files with 14,167 additions and 1,423 deletions.
diff --git a/.githooks/pre-commit b/.githooks/pre-commit
@@ -35,7 +35,7 @@ fi
 for file in $files; do
     if [[ -e $file ]]; then
         /usr/bin/perl -pi -e 'INIT { exit 1 if !-f $ARGV[0] || -B $ARGV[0]; $year = (localtime)[5] + 1900 }
-            s/^([*\/#[:space:]]*)Copyright\s+(?:\(C\)\s*)?(\d+)(?:\s*-\s*\d+)?/qq($1Copyright (c) $2@{[$year != $2 ? "-$year" : ""]})/ie
+            s/^([*\/#[:space:]]*)Copyright\s+(?:\(C\)\s*)?(\d+)(?:\s*-\s*\d+)?/qq($1Copyright (C) $2@{[$year != $2 ? "-$year" : ""]})/ie
             if $. < 10' "$file" && git add -u "$file"
     fi
 done

diff --git a/CMakeLists.txt b/CMakeLists.txt
@@ -1,5 +1,5 @@
 # ########################################################################
-# Copyright (c) 2018-2023 Advanced Micro Devices, Inc. All rights Reserved.
+# Copyright (C) 2018-2023 Advanced Micro Devices, Inc. All rights Reserved.
 #
 # Permission is hereby granted, free of charge, to any person obtaining a copy
 # of this software and associated documentation files (the "Software"), to deal
@@ -37,7 +37,6 @@ list(APPEND CMAKE_PREFIX_PATH ${ROCM_PATH}/llvm ${ROCM_PATH})
 # CMake modules
 list(APPEND CMAKE_MODULE_PATH
      ${CMAKE_CURRENT_SOURCE_DIR}/cmake
-     ${ROCM_PATH}/lib/cmake/hip
      ${ROCM_PATH}/hip/cmake)
 
 # Set a default build type if none was specified

diff --git a/clients/include/testing_local_matrix.hpp b/clients/include/testing_local_matrix.hpp
@@ -41,10 +41,12 @@ void testing_local_matrix_bad_args(void)
     set_device_rocalution(device);
     init_rocalution();
 
-    LocalMatrix<T>   mat1;
-    LocalMatrix<T>   mat2;
-    LocalVector<T>   vec1;
-    LocalVector<int> int1;
+    LocalMatrix<T>       mat1;
+    LocalMatrix<T>       mat2;
+    LocalVector<T>       vec1;
+    LocalVector<bool>    bool1;
+    LocalVector<int>     int1;
+    LocalVector<int64_t> int641;
 
     // null pointers
     int* null_int  = nullptr;
@@ -193,16 +195,28 @@ void testing_local_matrix_bad_args(void)
     }
 
     // AMG
+    {
+        int                   val;
+        LocalVector<bool>*    bool_null_vec  = nullptr;
+        LocalVector<int64_t>* int64_null_vec = nullptr;
+        ASSERT_DEATH(mat1.AMGGreedyAggregate(0.1, bool_null_vec, &int641, &int641),
+                     ".*Assertion.*connections != (NULL|__null)*");
+        ASSERT_DEATH(mat1.AMGGreedyAggregate(0.1, &bool1, int64_null_vec, &int641),
+                     ".*Assertion.*aggregates != (NULL|__null)*");
+        ASSERT_DEATH(mat1.AMGGreedyAggregate(0.1, &bool1, &int641, int64_null_vec),
+                     ".*Assertion.*aggregate_root_nodes != (NULL|__null)*");
+
+        LocalMatrix<T>* null_mat = nullptr;
+        ASSERT_DEATH(mat1.AMGSmoothedAggregation(0.1, bool1, int641, int641, null_mat),
+                     ".*Assertion.*prolong != (NULL|__null)*");
+    }
+
     {
         int               val;
         LocalVector<int>* null_vec = nullptr;
         LocalMatrix<T>*   null_mat = nullptr;
-        ASSERT_DEATH(mat1.AMGConnect(0.1, null_vec), ".*Assertion.*connections != (NULL|__null)*");
-        ASSERT_DEATH(mat1.AMGAggregate(int1, null_vec),
-                     ".*Assertion.*aggregates != (NULL|__null)*");
-        ASSERT_DEATH(mat1.AMGSmoothedAggregation(0.1, int1, int1, null_mat),
+        ASSERT_DEATH(mat1.AMGUnsmoothedAggregation(int641, int641, null_mat),
                      ".*Assertion.*prolong != (NULL|__null)*");
-        ASSERT_DEATH(mat1.AMGAggregation(int1, null_mat), ".*Assertion.*prolong != (NULL|__null)*");
         ASSERT_DEATH(mat1.InitialPairwiseAggregation(0.1, val, null_vec, val, &null_int, val, 0),
                      ".*Assertion.*G != (NULL|__null)*");
         ASSERT_DEATH(mat1.InitialPairwiseAggregation(0.1, val, &int1, val, &vint, val, 0),

diff --git a/clients/include/testing_saamg.hpp b/clients/include/testing_saamg.hpp
@@ -1,5 +1,5 @@
 /* ************************************************************************
- * Copyright (C) 2018-2022 Advanced Micro Devices, Inc. All rights Reserved.
+ * Copyright (C) 2018-2023 Advanced Micro Devices, Inc. All rights Reserved.
  *
  * Permission is hereby granted, free of charge, to any person obtaining a copy
  * of this software and associated documentation files (the "Software"), to deal
@@ -125,10 +125,6 @@ bool testing_saamg(Arguments argus)
     // Solver
     FCG<LocalMatrix<T>, LocalVector<T>, T> ls;
 
-    // Start time measurement
-    double tick = rocalution_time();
-    double tack = rocalution_time();
-
     // AMG
     SAAMG<LocalMatrix<T>, LocalVector<T>, T> p;
 
@@ -157,12 +153,6 @@ bool testing_saamg(Arguments argus)
     p.SetInterpRelax(2.0 / 3.0);
     p.BuildHierarchy();
 
-    // Stop build hierarchy time measurement
-    tack = rocalution_time();
-    std::cout << "Build Hierarchy took: " << (tack - tick) / 1e6 << " sec" << std::endl;
-    // Start smoother time measurement
-    tick = rocalution_time();
-
     // Get number of hierarchy levels
     int levels = p.GetNumLevels();
 
@@ -206,11 +196,6 @@ bool testing_saamg(Arguments argus)
 
     ls.Init(1e-8, 0.0, 1e+8, 10000);
 
-    // Stop build smoother time measurement
-    tack = rocalution_time();
-    std::cout << "Smoother build took: " << (tack - tick) / 1e6 << " sec" << std::endl;
-    // Start build time measurement
-    tick = rocalution_time();
     ls.Build();
 
     if(rebuildnumeric)
@@ -227,18 +212,8 @@ bool testing_saamg(Arguments argus)
     // Matrix format
     A.ConvertTo(format, format == BCSR ? argus.blockdim : 1);
 
-    // Stop building time measurement
-    tack = rocalution_time();
-    std::cout << "Build took: " << (tack - tick) / 1e6 << " sec" << std::endl;
-    // Start solving time measurement
-    tick = rocalution_time();
-
     ls.Solve(rebuildnumeric ? b2 : b, &x);
 
-    // Stop solving time measurement
-    tack = rocalution_time();
-    std::cout << "Solving took: " << (tack - tick) / 1e6 << " sec" << std::endl;
-
     // Verify solution
     x.ScaleAdd(-1.0, e);
     T nrm2 = x.Norm();

diff --git a/clients/samples/CMakeLists.txt b/clients/samples/CMakeLists.txt
@@ -83,6 +83,8 @@ if(SUPPORT_MPI)
   add_rocalution_example(cg-amg_mpi.cpp)
   add_rocalution_example(cg-rsamg_mpi.cpp)
   add_rocalution_example(cg_mpi.cpp)
+  add_rocalution_example(cg-saamg_mpi.cpp)
+  add_rocalution_example(cg-uaamg_mpi.cpp)
   add_rocalution_example(fcg_mpi.cpp)
   add_rocalution_example(fgmres_mpi.cpp)
   add_rocalution_example(global-io_mpi.cpp)

diff --git a/clients/samples/cg-saamg_mpi.cpp b/clients/samples/cg-saamg_mpi.cpp
@@ -0,0 +1,177 @@
+/* ************************************************************************
+ * Copyright (C) 2023 Advanced Micro Devices, Inc. All rights Reserved.
+ *
+ * Permission is hereby granted, free of charge, to any person obtaining a copy
+ * of this software and associated documentation files (the "Software"), to deal
+ * in the Software without restriction, including without limitation the rights
+ * to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+ * copies of the Software, and to permit persons to whom the Software is
+ * furnished to do so, subject to the following conditions:
+ *
+ * The above copyright notice and this permission notice shall be included in
+ * all copies or substantial portions of the Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+ * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+ * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+ * AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+ * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+ * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
+ * THE SOFTWARE.
+ *
+ * ************************************************************************ */
+
+#include "common.hpp"
+#include "utility.hpp"
+
+#include <iostream>
+#include <mpi.h>
+#include <rocalution/rocalution.hpp>
+
+using namespace rocalution;
+
+int main(int argc, char* argv[])
+{
+    // Initialize MPI
+    MPI_Init(&argc, &argv);
+    MPI_Comm comm = MPI_COMM_WORLD;
+
+    int rank;
+    int num_procs;
+
+    MPI_Comm_rank(comm, &rank);
+    MPI_Comm_size(comm, &num_procs);
+
+    if(argc < 2)
+    {
+        std::cerr << argv[0] << " <global_matrix>" << std::endl;
+        return -1;
+    }
+
+    // Disable OpenMP thread affinity
+    set_omp_affinity_rocalution(false);
+
+    // Initialize platform with rank and # of accelerator devices in the node
+    init_rocalution(rank, 8);
+
+    // Disable OpenMP
+    set_omp_threads_rocalution(1);
+
+    // Print rocALUTION info
+    info_rocalution();
+
+    // Load undistributed matrix
+    LocalMatrix<double> lmat;
+    lmat.ReadFileMTX(argv[1]);
+
+    // Global structures
+    ParallelManager      manager;
+    GlobalMatrix<double> mat;
+
+    // Distribute matrix - lmat will be destroyed
+    distribute_matrix(&comm, &lmat, &mat, &manager);
+
+    // rocALUTION vectors
+    GlobalVector<double> rhs(manager);
+    GlobalVector<double> x(manager);
+    GlobalVector<double> e(manager);
+
+    // Move objects to accelerator
+    mat.MoveToAccelerator();
+    x.MoveToAccelerator();
+    rhs.MoveToAccelerator();
+    e.MoveToAccelerator();
+
+    // Start time measurement
+    double tick, tack, start, end;
+    start = rocalution_time();
+
+    // Allocate vectors
+    x.Allocate("x", mat.GetN());
+    rhs.Allocate("rhs", mat.GetM());
+    e.Allocate("e", mat.GetN());
+
+    // Initialize rhs such that A 1 = rhs
+    e.Ones();
+    mat.Apply(e, &rhs);
+
+    // Initial zero guess
+    x.Zeros();
+
+    // Start time measurement
+    tick = rocalution_time();
+
+    // Linear Solver
+    CG<GlobalMatrix<double>, GlobalVector<double>, double> ls;
+
+    // AMG Preconditioner
+    SAAMG<GlobalMatrix<double>, GlobalVector<double>, double> p;
+    p.SetCoarseningStrategy(CoarseningStrategy::PMIS);
+    p.SetLumpingStrategy(LumpingStrategy::AddWeakConnections);
+    p.SetCoarsestLevel(2);
+    p.SetCouplingStrength(0.001);
+
+    // Disable verbosity output of AMG preconditioner
+    p.Verbose(0);
+
+    // Set solver preconditioner
+    ls.SetPreconditioner(p);
+    // Set solver operator
+    ls.SetOperator(mat);
+
+    // Build solver
+    ls.Build();
+
+    // Compute 2 coarsest levels on the host
+    p.SetHostLevels(2);
+
+    // Stop time measurement
+    tack = rocalution_time();
+
+    if(rank == 0)
+    {
+        std::cout << "Building took: " << (tack - tick) / 1e6 << " sec" << std::endl;
+    }
+
+    // Print matrix info
+    mat.Info();
+
+    // Initialize solver tolerances
+    ls.Init(1e-8, 1e-8, 1e+8, 10000);
+
+    // Set verbosity output
+    ls.Verbose(2);
+
+    // Start time measurement
+    tick = rocalution_time();
+
+    // Solve A x = rhs
+    ls.Solve(rhs, &x);
+
+    // Stop time measurement
+    tack = rocalution_time();
+
+    if(rank == 0)
+    {
+        std::cout << "Solver took: " << (tack - tick) / 1e6 << " sec" << std::endl;
+    }
+
+    // Clear solver
+    ls.Clear();
+
+    // Compute error L2 norm
+    e.ScaleAdd(-1.0, x);
+    double error = e.Norm();
+
+    if(rank == 0)
+    {
+        std::cout << "||e - x||_2 = " << error << std::endl;
+    }
+
+    // Stop rocALUTION platform
+    stop_rocalution();
+
+    MPI_Finalize();
+
+    return 0;
+}