Skip to content

Commit

Permalink
multi-node SA/UA/RSAMG (#212)
Browse files Browse the repository at this point in the history
* PMIS MPI support (#100)

* mpi to hip backend

* MPI enablement for PMIS

* MPI RS example

* clang-format

* example

* fix

* global Ext+I interpolation (#112)

* unused function

* Added PM as argument for aggregation function

* added pm for each level to MG class

* Added PM as argument for aggregation function #2

* Added PM as argument for aggregation function #3

* RS Ext+I added to global matrix

* modified example to dump some data for validation - work in progress

* RS Ext+I function added to headers

* RS Ext+I HIP implementation

* RS Ext+I host implementation

* Improved PM

* global Ext+I kernel update

* some multinode improvements (#118)

* added some more useful guards to parallel manager

* added CopyFromHostData, CopyToHostData, ExclusiveSum and Sort functionality for vector class ; moved boundary information from vector to matrix ; added GetFormat() to GlobalMatrix class

* clang-format

* P and R should be OperatorType, not LocalMatrix

* clang-format

* check if PM is valid when required

* added extra function for triple matrix product for simplicity

* clang-format

* clang-format

* skip free when ptr is nullptr

* fix memory leak in BaseAMG class

* rocsparse_csrgeam added

* renamed csr ext+i kernel because it is generally usable

* allowing CSR zero matrices with row_offset != nullptr, as well as zero vectors with size == 0

* clang-format

* duplicated row column entries throw a warning

* copy_x2x() functions added for readability and simplicity

* search and replace memcpy with copy fct

* search and replace memcpy with copy fct #2 ; fix for random csr generator to not generate duplicated row col entries

* clang-format

* fixes

* those asserts are wrong

* major version bump

* OpenMP parallel loop threshold need to be int64_t in order to work with larger structures

* Allow basic structures with 64bit entries, e.g. for global indices

* nnz should be 64bit ; also restructured RSAMG for readability

* vector size need to be 64bit locally - also added inclusive and exclusive sum functionality

* 64bit sizes for stencils

* host vector implementation changes for 64bit sizes and in/exclusive sum ; host I/O changed to always write 64bit sizes

* host stencil 64bit changes

* max residual index changed to 64bit accordingly

* solvers adjusted for 64bit nnz

* int64_t to double conversion

* allocation size should always be 64bit ; also added copy_h2h() for simplicity

* long and long long communication support added

* cleaned up types ; IndexType2 was a stupid name anyway

* removed deprecations (major release); enabled global structure support in RSAMG

* major changes to PM; added guards for transfers; removed deprecations; fixed int overflows; functionality to generate a PM from global ghost column ids, and a parent PM

* matrix conversions 64bit nnz support with guards

* host matrix I/O changed to always write 64bit sizes ; backward compatible

* host matrix implementations changed to 64bit nnz

* RSAMG restructured - global communication should not happen in local implementations ; switched to 64bit sizes

* host CSR matrix implementation

* hip implementation ; added copy_d2h/h2d/d2d for simplicity, with async flag

* adjusted unit tests to removed deprecated functions

* RSAMG MPI example updated

* fixed sanity assert

* doc update

* example should work with only 1 process

* global routines should work with single process

* global transpose operator

* using copy_h2h()

* _rocalution_sync should force a global barrier, too

* improved asynchronous apply / comm / halo apply

* accelerator must be available for pinned alloc/free

* fixing few compiler warnings

* readability

* removed the flood of printf on multi gpu systems

* adjusted openmp nested (deprecation) to v5.0

* weak scaling examples

* distributed laplacian generator

* updated rsamg example

* updated rsamg mpi example

* should use OperatorType, nothing else

* fixed RSDirectInterpolation(); fixed const PM issue

* updated unit tests

* types.hpp generated by cmake ; CSR(64/32) added on host ; moved RSPMIS communication into global matrix class

* removed old types.hpp

* initial implementation for unordered set and map on hip backend

* outsourced RSAMG to improve compilation performance; added async communication for multinode; moved multinode rspmis into globalmatrix; outsourced atomics

* clang-format

* fix for streams when not building for mpi

* SA amg merge fix

* fixed missing shared memory size

* clang-format

* clang-format

* typo

* clang format

* add blockdim to UAAMG benchmark

* adjusting unit tests for removed deprecated functions

* clang format

* test fix

* clang-format

* std::sort required algorithm header

* fixing merge error

* merge fix #2

* header cleaned up

* header cleaned up #2

* fix issue with HIP not being found

* free_pinned() does nothing on nullptr

* global triple matrix product

* proper error message when coarsening fails

* fixed a bug in global triplematrixproduct

* fixed a typo

* fixed compilation issue when HIP=off

* fixes COO and CSR conversions on both host and device, and ELL on host only (#211)

* empty matrix conversion fix

* host fallback fix for rsamg and triplematprod

* Fix documentation failures (#214)

Co-authored-by: jsandham <james.sandham@amd.com>

* Add Smoothed Aggregation to amgmpi branch (#213)

* Adding global aggregation to SAAMG (#166)

Co-authored-by: jsandham <james.sandham@amd.com>

* Add MPI support for global prolongation to SAAMG (#171)

Co-authored-by: jsandham <james.sandham@amd.com>

* Add MPI support for SAAMG global transpose (#172)

* Add MPI support for SAAMG global transpose

* Fix failures in greedy aggregation caused by unfilled aggregate_root_nodes array

---------

Co-authored-by: jsandham <james.sandham@amd.com>

* Add MPI unsmoothed aggregation (#174)

Co-authored-by: jsandham <james.sandham@amd.com>

* Adding debug printing to test triple product

* Adding debug print statements for testing

* Adding more debug printing

* Testing

* Testing

* Testing

* Testing

* Testing

* Testing

* Testing

* Testing

* Testing

* Fix floating point fault caused by division by zero

* Testing

* Testing

* Testing

* Testing

* Testing

* Testing

* Testing

* Fix failures in local matrix when max_nnz_per_row is too high

* Testing

* Fix bug where we were not using a large enough hash table size

* Fix discrepency in host and hip assert in ExtractSubMatrix

* Fixing hangs in multinode hip backend

* Fix RSAMG documentation warnings

* Testing MPI uaamg

* Fix testing_local_matrix failure

* Remove comments and temporary testing code

* PR fixes

* PR fixes

* PR fixes

* Clang formatting

---------

Co-authored-by: jsandham <james.sandham@amd.com>

* removed unused variables

* Add back functions that cannot be removed until next major release (#216)

Co-authored-by: jsandham <james.sandham@amd.com>

* fix for very large sizes where local ext matrix exceeds int32

* Remove print statements from saamg testing file

---------

Co-authored-by: James Sandham <33790278+jsandham@users.noreply.github.com>
Co-authored-by: jsandham <james.sandham@amd.com>
  • Loading branch information
3 people committed Nov 21, 2023
1 parent 045b889 commit 5a91521
Show file tree
Hide file tree
Showing 62 changed files with 14,167 additions and 1,423 deletions.
2 changes: 1 addition & 1 deletion .githooks/pre-commit
Original file line number Diff line number Diff line change
Expand Up @@ -35,7 +35,7 @@ fi
for file in $files; do
if [[ -e $file ]]; then
/usr/bin/perl -pi -e 'INIT { exit 1 if !-f $ARGV[0] || -B $ARGV[0]; $year = (localtime)[5] + 1900 }
s/^([*\/#[:space:]]*)Copyright\s+(?:\(C\)\s*)?(\d+)(?:\s*-\s*\d+)?/qq($1Copyright (c) $2@{[$year != $2 ? "-$year" : ""]})/ie
s/^([*\/#[:space:]]*)Copyright\s+(?:\(C\)\s*)?(\d+)(?:\s*-\s*\d+)?/qq($1Copyright (C) $2@{[$year != $2 ? "-$year" : ""]})/ie
if $. < 10' "$file" && git add -u "$file"
fi
done
Expand Down
3 changes: 1 addition & 2 deletions CMakeLists.txt
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
# ########################################################################
# Copyright (c) 2018-2023 Advanced Micro Devices, Inc. All rights Reserved.
# Copyright (C) 2018-2023 Advanced Micro Devices, Inc. All rights Reserved.
#
# Permission is hereby granted, free of charge, to any person obtaining a copy
# of this software and associated documentation files (the "Software"), to deal
Expand Down Expand Up @@ -37,7 +37,6 @@ list(APPEND CMAKE_PREFIX_PATH ${ROCM_PATH}/llvm ${ROCM_PATH})
# CMake modules
list(APPEND CMAKE_MODULE_PATH
${CMAKE_CURRENT_SOURCE_DIR}/cmake
${ROCM_PATH}/lib/cmake/hip
${ROCM_PATH}/hip/cmake)

# Set a default build type if none was specified
Expand Down
32 changes: 23 additions & 9 deletions clients/include/testing_local_matrix.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -41,10 +41,12 @@ void testing_local_matrix_bad_args(void)
set_device_rocalution(device);
init_rocalution();

LocalMatrix<T> mat1;
LocalMatrix<T> mat2;
LocalVector<T> vec1;
LocalVector<int> int1;
LocalMatrix<T> mat1;
LocalMatrix<T> mat2;
LocalVector<T> vec1;
LocalVector<bool> bool1;
LocalVector<int> int1;
LocalVector<int64_t> int641;

// null pointers
int* null_int = nullptr;
Expand Down Expand Up @@ -193,16 +195,28 @@ void testing_local_matrix_bad_args(void)
}

// AMG
{
int val;
LocalVector<bool>* bool_null_vec = nullptr;
LocalVector<int64_t>* int64_null_vec = nullptr;
ASSERT_DEATH(mat1.AMGGreedyAggregate(0.1, bool_null_vec, &int641, &int641),
".*Assertion.*connections != (NULL|__null)*");
ASSERT_DEATH(mat1.AMGGreedyAggregate(0.1, &bool1, int64_null_vec, &int641),
".*Assertion.*aggregates != (NULL|__null)*");
ASSERT_DEATH(mat1.AMGGreedyAggregate(0.1, &bool1, &int641, int64_null_vec),
".*Assertion.*aggregate_root_nodes != (NULL|__null)*");

LocalMatrix<T>* null_mat = nullptr;
ASSERT_DEATH(mat1.AMGSmoothedAggregation(0.1, bool1, int641, int641, null_mat),
".*Assertion.*prolong != (NULL|__null)*");
}

{
int val;
LocalVector<int>* null_vec = nullptr;
LocalMatrix<T>* null_mat = nullptr;
ASSERT_DEATH(mat1.AMGConnect(0.1, null_vec), ".*Assertion.*connections != (NULL|__null)*");
ASSERT_DEATH(mat1.AMGAggregate(int1, null_vec),
".*Assertion.*aggregates != (NULL|__null)*");
ASSERT_DEATH(mat1.AMGSmoothedAggregation(0.1, int1, int1, null_mat),
ASSERT_DEATH(mat1.AMGUnsmoothedAggregation(int641, int641, null_mat),
".*Assertion.*prolong != (NULL|__null)*");
ASSERT_DEATH(mat1.AMGAggregation(int1, null_mat), ".*Assertion.*prolong != (NULL|__null)*");
ASSERT_DEATH(mat1.InitialPairwiseAggregation(0.1, val, null_vec, val, &null_int, val, 0),
".*Assertion.*G != (NULL|__null)*");
ASSERT_DEATH(mat1.InitialPairwiseAggregation(0.1, val, &int1, val, &vint, val, 0),
Expand Down
27 changes: 1 addition & 26 deletions clients/include/testing_saamg.hpp
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
/* ************************************************************************
* Copyright (C) 2018-2022 Advanced Micro Devices, Inc. All rights Reserved.
* Copyright (C) 2018-2023 Advanced Micro Devices, Inc. All rights Reserved.
*
* Permission is hereby granted, free of charge, to any person obtaining a copy
* of this software and associated documentation files (the "Software"), to deal
Expand Down Expand Up @@ -125,10 +125,6 @@ bool testing_saamg(Arguments argus)
// Solver
FCG<LocalMatrix<T>, LocalVector<T>, T> ls;

// Start time measurement
double tick = rocalution_time();
double tack = rocalution_time();

// AMG
SAAMG<LocalMatrix<T>, LocalVector<T>, T> p;

Expand Down Expand Up @@ -157,12 +153,6 @@ bool testing_saamg(Arguments argus)
p.SetInterpRelax(2.0 / 3.0);
p.BuildHierarchy();

// Stop build hierarchy time measurement
tack = rocalution_time();
std::cout << "Build Hierarchy took: " << (tack - tick) / 1e6 << " sec" << std::endl;
// Start smoother time measurement
tick = rocalution_time();

// Get number of hierarchy levels
int levels = p.GetNumLevels();

Expand Down Expand Up @@ -206,11 +196,6 @@ bool testing_saamg(Arguments argus)

ls.Init(1e-8, 0.0, 1e+8, 10000);

// Stop build smoother time measurement
tack = rocalution_time();
std::cout << "Smoother build took: " << (tack - tick) / 1e6 << " sec" << std::endl;
// Start build time measurement
tick = rocalution_time();
ls.Build();

if(rebuildnumeric)
Expand All @@ -227,18 +212,8 @@ bool testing_saamg(Arguments argus)
// Matrix format
A.ConvertTo(format, format == BCSR ? argus.blockdim : 1);

// Stop building time measurement
tack = rocalution_time();
std::cout << "Build took: " << (tack - tick) / 1e6 << " sec" << std::endl;
// Start solving time measurement
tick = rocalution_time();

ls.Solve(rebuildnumeric ? b2 : b, &x);

// Stop solving time measurement
tack = rocalution_time();
std::cout << "Solving took: " << (tack - tick) / 1e6 << " sec" << std::endl;

// Verify solution
x.ScaleAdd(-1.0, e);
T nrm2 = x.Norm();
Expand Down
2 changes: 2 additions & 0 deletions clients/samples/CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -83,6 +83,8 @@ if(SUPPORT_MPI)
add_rocalution_example(cg-amg_mpi.cpp)
add_rocalution_example(cg-rsamg_mpi.cpp)
add_rocalution_example(cg_mpi.cpp)
add_rocalution_example(cg-saamg_mpi.cpp)
add_rocalution_example(cg-uaamg_mpi.cpp)
add_rocalution_example(fcg_mpi.cpp)
add_rocalution_example(fgmres_mpi.cpp)
add_rocalution_example(global-io_mpi.cpp)
Expand Down
177 changes: 177 additions & 0 deletions clients/samples/cg-saamg_mpi.cpp
Original file line number Diff line number Diff line change
@@ -0,0 +1,177 @@
/* ************************************************************************
* Copyright (C) 2023 Advanced Micro Devices, Inc. All rights Reserved.
*
* Permission is hereby granted, free of charge, to any person obtaining a copy
* of this software and associated documentation files (the "Software"), to deal
* in the Software without restriction, including without limitation the rights
* to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
* copies of the Software, and to permit persons to whom the Software is
* furnished to do so, subject to the following conditions:
*
* The above copyright notice and this permission notice shall be included in
* all copies or substantial portions of the Software.
*
* THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
* IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
* FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
* AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
* LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
* OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
* THE SOFTWARE.
*
* ************************************************************************ */

#include "common.hpp"
#include "utility.hpp"

#include <iostream>
#include <mpi.h>
#include <rocalution/rocalution.hpp>

using namespace rocalution;

int main(int argc, char* argv[])
{
// Initialize MPI
MPI_Init(&argc, &argv);
MPI_Comm comm = MPI_COMM_WORLD;

int rank;
int num_procs;

MPI_Comm_rank(comm, &rank);
MPI_Comm_size(comm, &num_procs);

if(argc < 2)
{
std::cerr << argv[0] << " <global_matrix>" << std::endl;
return -1;
}

// Disable OpenMP thread affinity
set_omp_affinity_rocalution(false);

// Initialize platform with rank and # of accelerator devices in the node
init_rocalution(rank, 8);

// Disable OpenMP
set_omp_threads_rocalution(1);

// Print rocALUTION info
info_rocalution();

// Load undistributed matrix
LocalMatrix<double> lmat;
lmat.ReadFileMTX(argv[1]);

// Global structures
ParallelManager manager;
GlobalMatrix<double> mat;

// Distribute matrix - lmat will be destroyed
distribute_matrix(&comm, &lmat, &mat, &manager);

// rocALUTION vectors
GlobalVector<double> rhs(manager);
GlobalVector<double> x(manager);
GlobalVector<double> e(manager);

// Move objects to accelerator
mat.MoveToAccelerator();
x.MoveToAccelerator();
rhs.MoveToAccelerator();
e.MoveToAccelerator();

// Start time measurement
double tick, tack, start, end;
start = rocalution_time();

// Allocate vectors
x.Allocate("x", mat.GetN());
rhs.Allocate("rhs", mat.GetM());
e.Allocate("e", mat.GetN());

// Initialize rhs such that A 1 = rhs
e.Ones();
mat.Apply(e, &rhs);

// Initial zero guess
x.Zeros();

// Start time measurement
tick = rocalution_time();

// Linear Solver
CG<GlobalMatrix<double>, GlobalVector<double>, double> ls;

// AMG Preconditioner
SAAMG<GlobalMatrix<double>, GlobalVector<double>, double> p;
p.SetCoarseningStrategy(CoarseningStrategy::PMIS);
p.SetLumpingStrategy(LumpingStrategy::AddWeakConnections);
p.SetCoarsestLevel(2);
p.SetCouplingStrength(0.001);

// Disable verbosity output of AMG preconditioner
p.Verbose(0);

// Set solver preconditioner
ls.SetPreconditioner(p);
// Set solver operator
ls.SetOperator(mat);

// Build solver
ls.Build();

// Compute 2 coarsest levels on the host
p.SetHostLevels(2);

// Stop time measurement
tack = rocalution_time();

if(rank == 0)
{
std::cout << "Building took: " << (tack - tick) / 1e6 << " sec" << std::endl;
}

// Print matrix info
mat.Info();

// Initialize solver tolerances
ls.Init(1e-8, 1e-8, 1e+8, 10000);

// Set verbosity output
ls.Verbose(2);

// Start time measurement
tick = rocalution_time();

// Solve A x = rhs
ls.Solve(rhs, &x);

// Stop time measurement
tack = rocalution_time();

if(rank == 0)
{
std::cout << "Solver took: " << (tack - tick) / 1e6 << " sec" << std::endl;
}

// Clear solver
ls.Clear();

// Compute error L2 norm
e.ScaleAdd(-1.0, x);
double error = e.Norm();

if(rank == 0)
{
std::cout << "||e - x||_2 = " << error << std::endl;
}

// Stop rocALUTION platform
stop_rocalution();

MPI_Finalize();

return 0;
}

0 comments on commit 5a91521

Please sign in to comment.