Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

errors in second call to rocsparse_sps{v,m} #332

Open
jakub-homola opened this issue Jun 17, 2023 · 2 comments
Open

errors in second call to rocsparse_sps{v,m} #332

jakub-homola opened this issue Jun 17, 2023 · 2 comments
Assignees

Comments

@jakub-homola
Copy link

jakub-homola commented Jun 17, 2023

Hello,

I have cholesky factors of a matrix $A=U^\top U$. Now I want to solve the system $Az=x$, which is equivalent to first solving $U^\top y = x$ and then solving $Uz=y$. Implementing this in rocSPARSE, I am getting errors in the compute stage of the second system (for both cases when the right-hand-side is a vector or a matrix, but different errors).

What is the expected behavior

The attached program should not fail.

What actually happens

In the compute stage of the second system, the program segfaults in the case of rocsparse_spsm, and in the case of rocsparse_spsv, rosparse_status = 3 (probably rocsparse_status_invalid_pointer) is returned from the function.

How to reproduce

See source.hip.cpp.txt (remove the .txt extension, for some reason .cpp files are not allowed here). There I allocate device memory for the sparse matrix $U$ and vectors $x$, $y$, $z$. I fill the sparse matrix data such that the matrix represents an identity matrix (for simplicity), and set the vectors to zero. Create the rocsparse_handle and matrix descriptor and set the matrix attributes. Then, based on the command-line argument (V or M), I perform the solution of the two systems with either a right-hand-side vector or right-hand-side matrix (using rocsparse_spsv or rocsparse_spsm, respectively). These two systems (with $U^\top$ and $U$) and the code are identical except for the transpose parameter and the used buffer. To finally solve the system, I query the buffersize, allocate the buffer, and do the preprocess and compute stages. In the end, I destroy and free everything.

If I use a separate matrix descriptor for the second system, the program works fine. If the transpose parameters in the two systems are the same, the program works fine. Running only the second system (commenting out the first), the program works fine.

Compile using hipcc -g -O2 --offload-arch=gfx90a:sramecc+:xnack- source.hip.cpp -o program.x -lrocsparse. Run with ./program.x V for spsv and ./program.x M for spsm.

Environment

I use the LUMI supercomputer. Accelerated compute node with MI250x gpu (I use 1/8 of the node, that is a single GPU die), module load LUMI/22.12 rocm/5.2.3. I also tested with rocm/5.5.1 on MI100, and the problems are there too.

Thanks in advance for taking a look at this. From my side it looks like a bug in rocSPARSE. In case I missed an important detail or you need more info, please ask.

Jakub

@jakub-homola
Copy link
Author

jakub-homola commented Jun 18, 2023

Okay I might have found the issue.

In the rocsparse_csrsv_analysis and rocsparse_csrsm_analysis documentation (link, link), which the rocsparse_spsv and rocsparse_spsm functions call internaly (or actually they call the same internal functions rocsparse_csrs{v,m}_analysis_template), it is stated that "It is expected that this function will be executed only once for a given matrix and particular operation type".

So I should create a new matrix descriptor for a different operation with the same matrix. So it is not actually a matrix descriptor, but more of a descriptor of the matrix and the operation I am doing with it.

Is that right? Is that the cause of my issuses? The errors are occuring in the compute stage, but the docs state that analysis should not be called twice, so I am not sure.

If this is really the problem, then this restriction is not mentioned in the documentation of rocsparse_spsv and rocsparse_spsm.

Furthermore, I am now having a very similar problem with rocsparse_spmv, I am performing operations $y=A^\top x$ and then $z=Ay$, and the second spmv fails with rocsparse_status_invalid_value, most probably on this line. Again, the restriction is mentioned in the documentation/description of rocsparse_csrmv_analysis, but not in rocsparse_spmv.

@jakub-homola
Copy link
Author

jakub-homola commented Jan 2, 2024

BTW, the fact that the rocsparse_spmat_descr also stores data relevat to the operation being performed with the matrix causes an incompatibility with cusparse (at least in the sparse generic functions). In cusparse, they have a separate descriptor describing the matrix properties (cusparseSpMatDescr_t), and a separate descriptor for the operation being performed (cusparseSpSMDescr_t), which in rocsparse does not exist. People coming from cusparse, expecting hipsparse to work well on both NVIDIA and AMD, might be surprised that they cannot use the same sparse matrix descriptor for two operations with it (this actually happened to me when I opened the issue).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants