errors in second call to rocsparse_sps{v,m} #332

jakub-homola · 2023-06-17T13:47:46Z

Hello,

I have cholesky factors of a matrix $A=U^\top U$. Now I want to solve the system $Az=x$, which is equivalent to first solving $U^\top y = x$ and then solving $Uz=y$. Implementing this in rocSPARSE, I am getting errors in the compute stage of the second system (for both cases when the right-hand-side is a vector or a matrix, but different errors).

What is the expected behavior

The attached program should not fail.

What actually happens

In the compute stage of the second system, the program segfaults in the case of rocsparse_spsm, and in the case of rocsparse_spsv, rosparse_status = 3 (probably rocsparse_status_invalid_pointer) is returned from the function.

How to reproduce

See source.hip.cpp.txt (remove the .txt extension, for some reason .cpp files are not allowed here). There I allocate device memory for the sparse matrix $U$ and vectors $x$, $y$, $z$. I fill the sparse matrix data such that the matrix represents an identity matrix (for simplicity), and set the vectors to zero. Create the rocsparse_handle and matrix descriptor and set the matrix attributes. Then, based on the command-line argument (V or M), I perform the solution of the two systems with either a right-hand-side vector or right-hand-side matrix (using rocsparse_spsv or rocsparse_spsm, respectively). These two systems (with $U^\top$ and $U$) and the code are identical except for the transpose parameter and the used buffer. To finally solve the system, I query the buffersize, allocate the buffer, and do the preprocess and compute stages. In the end, I destroy and free everything.

If I use a separate matrix descriptor for the second system, the program works fine. If the transpose parameters in the two systems are the same, the program works fine. Running only the second system (commenting out the first), the program works fine.

Compile using hipcc -g -O2 --offload-arch=gfx90a:sramecc+:xnack- source.hip.cpp -o program.x -lrocsparse. Run with ./program.x V for spsv and ./program.x M for spsm.

Environment

I use the LUMI supercomputer. Accelerated compute node with MI250x gpu (I use 1/8 of the node, that is a single GPU die), module load LUMI/22.12 rocm/5.2.3. I also tested with rocm/5.5.1 on MI100, and the problems are there too.

Thanks in advance for taking a look at this. From my side it looks like a bug in rocSPARSE. In case I missed an important detail or you need more info, please ask.

Jakub

The text was updated successfully, but these errors were encountered:

jakub-homola · 2023-06-18T15:38:18Z

Okay I might have found the issue.

In the rocsparse_csrsv_analysis and rocsparse_csrsm_analysis documentation (link, link), which the rocsparse_spsv and rocsparse_spsm functions call internaly (or actually they call the same internal functions rocsparse_csrs{v,m}_analysis_template), it is stated that "It is expected that this function will be executed only once for a given matrix and particular operation type".

So I should create a new matrix descriptor for a different operation with the same matrix. So it is not actually a matrix descriptor, but more of a descriptor of the matrix and the operation I am doing with it.

Is that right? Is that the cause of my issuses? The errors are occuring in the compute stage, but the docs state that analysis should not be called twice, so I am not sure.

If this is really the problem, then this restriction is not mentioned in the documentation of rocsparse_spsv and rocsparse_spsm.

Furthermore, I am now having a very similar problem with rocsparse_spmv, I am performing operations $y=A^\top x$ and then $z=Ay$, and the second spmv fails with rocsparse_status_invalid_value, most probably on this line. Again, the restriction is mentioned in the documentation/description of rocsparse_csrmv_analysis, but not in rocsparse_spmv.

jakub-homola · 2024-01-02T19:21:12Z

BTW, the fact that the rocsparse_spmat_descr also stores data relevat to the operation being performed with the matrix causes an incompatibility with cusparse (at least in the sparse generic functions). In cusparse, they have a separate descriptor describing the matrix properties (cusparseSpMatDescr_t), and a separate descriptor for the operation being performed (cusparseSpSMDescr_t), which in rocsparse does not exist. People coming from cusparse, expecting hipsparse to work well on both NVIDIA and AMD, might be surprised that they cannot use the same sparse matrix descriptor for two operations with it (this actually happened to me when I opened the issue).

doctorcolinsmith assigned YvanMokwinski Jun 19, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

errors in second call to rocsparse_sps{v,m} #332

errors in second call to rocsparse_sps{v,m} #332

jakub-homola commented Jun 17, 2023 •

edited

Loading

jakub-homola commented Jun 18, 2023 •

edited

Loading

jakub-homola commented Jan 2, 2024 •

edited

Loading

errors in second call to rocsparse_sps{v,m} #332

errors in second call to rocsparse_sps{v,m} #332

Comments

jakub-homola commented Jun 17, 2023 • edited Loading

What is the expected behavior

What actually happens

How to reproduce

Environment

jakub-homola commented Jun 18, 2023 • edited Loading

jakub-homola commented Jan 2, 2024 • edited Loading

jakub-homola commented Jun 17, 2023 •

edited

Loading

jakub-homola commented Jun 18, 2023 •

edited

Loading

jakub-homola commented Jan 2, 2024 •

edited

Loading