【Hackathon 4 No.9】Add pca_lowrank API to Paddle #53743

Patrick-Star125 · 2023-05-12T03:08:41Z

PR types

New features

PR changes

APIs

Description

Add pca_lowrank API to Paddle

Rfc PR: PaddlePaddle/community#474

待完成

bug修正
中文文档：docs: add pca_lowrank docs docs#5904

paddle-bot · 2023-05-12T03:08:46Z

你的PR提交成功，感谢你对开源项目的贡献!
请关注后续CI自动化测试结果，详情请参考Paddle-CI手册。
Your PR has been submitted. Thanks for your contribution!
Please wait for the result of CI firstly. See Paddle CI Manual for details.

paddle-bot · 2023-05-12T03:08:49Z

❌ The PR is not created using PR's template. You can refer to this Demo.
Please use PR's template, it helps save our maintainers' time so that more developers get helped.

Patrick-Star125 · 2023-05-14T02:53:12Z

我对于paddle的稀疏算子的使用似乎不规范，对于当前sparse::matmul的报错，在线上CI和线下Linux平台测试中的报错不相同。

线上的报错为：

NotFoundError: The kernel `matmul_coo_dense` is not registered.

线下的报错为：

On entry to cusparseCreateCoo(): dimension mismatch, nnz (6) > matrix size (4)

线下python端的报错更详细一些

 ** On entry to cusparseCreateCoo() dimension mismatch: nnz > rows * cols
 ** On entry to cusparseSpMM_bufferSize() parameter number 5 (matA) had an illegal value: already destroyed
 ** On entry to cusparseSpMM() parameter number 5 (matA) had an illegal value: already destroyed
 ** On entry to cusparseDestroySpMat() parameter number 1 (spMatDescr) had an illegal value: already destroyed

我想知道问题在于sparse matmul算子使用还是sparse tensor的创建。nnz > rows * cols是什么含义

paddle-ci-bot · 2023-05-21T03:14:31Z

Sorry to inform you that 485662d's CIs have passed for more than 7 days. To prevent PR conflicts, you need to re-run all CIs manually.

zkh2016 · 2023-05-23T11:40:57Z

nnz > rows * cols是什么含义

这个意思是说矩阵的非零元素的个数比 rows * cols大，rows是矩阵的行数，cols是矩阵的列数。

要不在创建sparse tensor和调用sparse matmul的前后加点信息看看哪里开始有异常？

Patrick-Star125 · 2023-05-29T06:29:12Z

这个问题的来源是当前paddle.sparse.sum API在某些情况下出现会出现非零元素数量多于tensor元素数量的情况，我不太确定这是否正常，下面是复现代码

# 原始数据 Paddle
Tensor(shape=[17, 4], dtype=paddle.float64, place=Place(gpu:0), stop_gradient=True, 
       indices=[[0, 1, 2, 3, 4, 5],
                [1, 0, 0, 3, 1, 2]], 
       values=[-0.35408317,  0.34116612,  0.01369466, -0.04936034,  0.00005647,
               -0.00023232])

# 原始数据 Pytorch
tensor(indices=tensor([[0, 1, 2, 3, 4, 5],
                       [1, 0, 0, 3, 1, 2]]),
       values=tensor([-3.5408e-01,  3.4117e-01,  1.3695e-02, -4.9360e-02,
                       5.6474e-05, -2.3232e-04]),
       size=(17, 4), nnz=6, dtype=torch.float64, layout=torch.sparse_coo)

paddle.sparse.sum(x, axis=-2) # Paddle结果
Tensor(shape=[4], dtype=paddle.float64, place=Place(gpu:0), stop_gradient=True, 
       indices=[[1, 0, 0, 3, 1, 2]], 
       values=[-0.35402670,  0.35486078,  0.35486078, -0.04936034, -0.35402670,
               -0.00023232])

torch.sparse.sum(s_t_x, dim=-2) # Pytorch结果
tensor(indices=tensor([[0, 1, 2, 3]]),
       values=tensor([ 3.5486e-01, -3.5403e-01, -2.3232e-04, -4.9360e-02]),
       size=(4,), nnz=4, dtype=torch.float64, layout=torch.sparse_coo)

因此sum的结果在sparse.matmul计算时会得到nnz > rows * cols的警告，再深入打印可以发现

paddle.sparse.matmul(C_t, ones_m1_t)和paddle.matmul(C_t.to_dense(), ones_m1_t)的结果也出现了不一致的问题

zkh2016 · 2023-05-29T07:32:01Z

这个问题的来源是当前paddle.sparse.sum API在某些情况下出现会出现非零元素数量多于tensor元素数量的情况，我不太确定这是否正常，下面是复现代码

# 原始数据 Paddle
Tensor(shape=[17, 4], dtype=paddle.float64, place=Place(gpu:0), stop_gradient=True, 
       indices=[[0, 1, 2, 3, 4, 5],
                [1, 0, 0, 3, 1, 2]], 
       values=[-0.35408317,  0.34116612,  0.01369466, -0.04936034,  0.00005647,
               -0.00023232])

# 原始数据 Pytorch
tensor(indices=tensor([[0, 1, 2, 3, 4, 5],
                       [1, 0, 0, 3, 1, 2]]),
       values=tensor([-3.5408e-01,  3.4117e-01,  1.3695e-02, -4.9360e-02,
                       5.6474e-05, -2.3232e-04]),
       size=(17, 4), nnz=6, dtype=torch.float64, layout=torch.sparse_coo)

paddle.sparse.sum(x, axis=-2) # Paddle结果
Tensor(shape=[4], dtype=paddle.float64, place=Place(gpu:0), stop_gradient=True, 
       indices=[[1, 0, 0, 3, 1, 2]], 
       values=[-0.35402670,  0.35486078,  0.35486078, -0.04936034, -0.35402670,
               -0.00023232])

torch.sparse.sum(s_t_x, dim=-2) # Pytorch结果
tensor(indices=tensor([[0, 1, 2, 3]]),
       values=tensor([ 3.5486e-01, -3.5403e-01, -2.3232e-04, -4.9360e-02]),
       size=(4,), nnz=4, dtype=torch.float64, layout=torch.sparse_coo)

因此sum的结果在sparse.matmul计算时会得到nnz > rows * cols的警告，再深入打印可以发现

paddle.sparse.matmul(C_t, ones_m1_t)和paddle.matmul(C_t.to_dense(), ones_m1_t)的结果也出现了不一致的问题

@zrr1999 看起来是sparse.sum的问题，能否定位下？

zrr1999 · 2023-05-29T08:08:37Z

这个问题的来源是当前paddle.sparse.sum API在某些情况下出现会出现非零元素数量多于tensor元素数量的情况，我不太确定这是否正常，下面是复现代码

# 原始数据 Paddle
Tensor(shape=[17, 4], dtype=paddle.float64, place=Place(gpu:0), stop_gradient=True, 
       indices=[[0, 1, 2, 3, 4, 5],
                [1, 0, 0, 3, 1, 2]], 
       values=[-0.35408317,  0.34116612,  0.01369466, -0.04936034,  0.00005647,
               -0.00023232])

# 原始数据 Pytorch
tensor(indices=tensor([[0, 1, 2, 3, 4, 5],
                       [1, 0, 0, 3, 1, 2]]),
       values=tensor([-3.5408e-01,  3.4117e-01,  1.3695e-02, -4.9360e-02,
                       5.6474e-05, -2.3232e-04]),
       size=(17, 4), nnz=6, dtype=torch.float64, layout=torch.sparse_coo)

paddle.sparse.sum(x, axis=-2) # Paddle结果
Tensor(shape=[4], dtype=paddle.float64, place=Place(gpu:0), stop_gradient=True, 
       indices=[[1, 0, 0, 3, 1, 2]], 
       values=[-0.35402670,  0.35486078,  0.35486078, -0.04936034, -0.35402670,
               -0.00023232])

torch.sparse.sum(s_t_x, dim=-2) # Pytorch结果
tensor(indices=tensor([[0, 1, 2, 3]]),
       values=tensor([ 3.5486e-01, -3.5403e-01, -2.3232e-04, -4.9360e-02]),
       size=(4,), nnz=4, dtype=torch.float64, layout=torch.sparse_coo)

因此sum的结果在sparse.matmul计算时会得到nnz > rows * cols的警告，再深入打印可以发现
paddle.sparse.matmul(C_t, ones_m1_t)和paddle.matmul(C_t.to_dense(), ones_m1_t)的结果也出现了不一致的问题

@zrr1999 看起来是sparse.sum的问题，能否定位下？

好的，我去查一下

… pca

Patrick-Star125 · 2023-05-30T23:23:50Z

将paddle.sparse.matmul绕过后单测可以通过，能麻烦对代码要更改的点都review一下吗

zhengqiwen1997 · 2023-05-31T06:45:04Z

请把PR-CI-Codestyle-Check的流水线修过了，其他的我看没问题了。

luotao1 · 2023-05-31T06:52:15Z

同时需要补充单测过一下coverage流水线

Patrick-Star125 · 2023-06-04T02:40:07Z

是否需要在spare.pca_lowrank代码中添加CUDA11.0的限制

zhengqiwen1997 · 2023-06-05T03:16:40Z

是否需要在spare.pca_lowrank代码中添加CUDA11.0的限制

不需要。windows流水线重新run一下

sunzhongkai588 · 2023-06-05T06:57:30Z

python/paddle/tensor/linalg.py

+            positive number. The data type of x should be float32 or float64.
+        q (int, optional): a slightly overestimated rank of :math:`X`.
+            Default value is :math:`q=min(6,N,M)`.
+        center (bool, optional): if True, center the input tensor, otherwise,


if True, center the input tensor, otherwise,Default value is True. otherwise后面是不是少了一段话？不太通顺。

zhengqiwen1997 · 2023-06-05T07:10:49Z

static的CI没通过，看起来是因为sparse API的文档代码跑到了 return _C_ops.sparse_matmul(x, y)，而这个必须cuda11.X支持。看起来需要在spare.pca_lowrank代码中添加CUDA11.0的限制来通过static的CI。

Patrick-Star125 · 2023-06-05T13:34:53Z

我在示例代码和计算代码中都添加了CUDA11限制，请问保留哪一处比较合适？

zhengqiwen1997 · 2023-06-06T06:54:30Z

两处都加限制吧；我看示例代码加了限制，但static的CI还是失败了。是因为你用 os.environ["CUDA_VISIBLE_DEVICES"] = "" 隐藏GPU了，但是此CI的cuda版本大于11.0，所以跑错分支了，可以删除os的限制吧。

Patrick-Star125 · 2023-06-06T08:23:12Z

这里OS的限制是自动加的，好像是通过这种方式限制gpu在cpu模式下运行，源码在tools\sampcd_processor.py

os.environ["CUDA_VISIBLE_DEVICES"] = ""可以让CI在有GPU的前提下使用CPU运行代码，但是方法paddle.version.cuda()依然可以看到cuda存在，所以在这里没法绕过sparse.pca_lowrank的调用

如果使用os.environ["CUDA_VISIBLE_DEVICES"]的值来判断cuda存在，在用户端会产生报错（我自己测的），这段example code就不可用了

我的想法是简单使用注释的方式解释，如下所示，这样CI能通过，用户也能看懂

    print("sparse.pca_lowrank API only support CUDA 11.x")
    U, S, V = None, None, None
    # U, S, V = pca_lowrank(sparse_x)

zhengqiwen1997 · 2023-06-06T08:39:35Z

好的，麻烦@sunzhongkai588 看看这样写文档代码可以吗？

sunzhongkai588 · 2023-06-07T03:16:42Z

这里OS的限制是自动加的，好像是通过这种方式限制gpu在cpu模式下运行，源码在tools\sampcd_processor.py

os.environ["CUDA_VISIBLE_DEVICES"] = ""可以让CI在有GPU的前提下使用CPU运行代码，但是方法paddle.version.cuda()依然可以看到cuda存在，所以在这里没法绕过sparse.pca_lowrank的调用

如果使用os.environ["CUDA_VISIBLE_DEVICES"]的值来判断cuda存在，在用户端会产生报错（我自己测的），这段example code就不可用了

我的想法是简单使用注释的方式解释，如下所示，这样CI能通过，用户也能看懂
    print("sparse.pca_lowrank API only support CUDA 11.x")
    U, S, V = None, None, None
    # U, S, V = pca_lowrank(sparse_x)

    print("sparse.pca_lowrank API only support CUDA 11.x")
    U, S, V = None, None, None
    # U, S, V = pca_lowrank(sparse_x)

@Patrick-Star125 注释这行，可以简单写清楚是什么情况下用。
另外，ci好像还是没有过

Patrick-Star125 · 2023-06-07T08:43:21Z

简单加了一行解释，static的CI过了

sunzhongkai588

LGTM for docs

sunzhongkai588

LGTM for docs

zhengqiwen1997

Coverage的CI可以豁免，因为sparse的功能需要cuda11.X。

jeff41404 · 2023-06-07T10:04:11Z

test/legacy_test/test_sparse_pca_lowrank.py

+        x = paddle.sparse.sparse_coo_tensor(
+            indices_tensor, values, (rows, columns)
+        )


we also need to test the sparse_csr_tensor

This API don't support sparse_csr_tensor due to sparse.sum with axis(-2) has not been implemented.

(Unimplemented) axis of SumCsrKernel only support None or -1 now.More number will be supported in the future.

Please add a TODO to remind us to complement the test when sparse.sum with axis(-2) is implemented in the future.

jeff41404 · 2023-06-07T10:09:09Z

python/paddle/sparse/unary.py

+        if x.is_sparse():
+            return paddle.sparse.matmul(x, B)


This API requires sparse tensor as input(L1084) , so there is no need for this judgment and encapsulation anymore?

some parts of code would use paddle.matmul such as L1057 matmul(M, Q_c)

if matmul(M, Q_c) use paddle.matmul then write it directly, I think it is easier to understand.

jeff41404 · 2023-06-07T10:09:52Z

python/paddle/sparse/unary.py

+        if x.is_sparse():
+            return paddle.sparse.transpose(x, perm)


This API requires sparse tensor as input(L1084) , so there is no need for this judgment anymore?

some parts of code would use paddle.matmul such as L1127

Considering saving calculation code of perm, this can be preserved

jeff41404 · 2023-06-07T10:18:03Z

RFC needs to be modified to match the code. e.g. API in RFC is paddle.pca_lowrank(A: Tensor, q: Optional[int] = None, center: bool = True, niter: int = 2), but code is paddle.pca_lowrank(**x**, q, center, niter, name); and paddle.sparse.pca_lowrank is not mentioned in RFC.

Patrick-Star125 · 2023-06-07T14:33:22Z

rfc modify pr has been submit PaddlePaddle/community#555

jeff41404

LGTM

add pca_lowrank api to paddle

3e473c8

paddle-bot bot added contributor External developers status: proposed labels May 12, 2023

Patrick-Star125 mentioned this pull request May 12, 2023

【PaddlePaddle Hackathon 第四期】任务总览 #51281

Closed

luotao1 assigned luotao1, Ligoml and zhengqiwen1997 May 12, 2023

Patrick-Star125 added 3 commits May 12, 2023 19:59

remove divide

f0bc10a

add more test

ce07c97

downsize test sample

485662d

luotao1 added the PaddlePaddle Hackathon label May 17, 2023

paddle-bot bot removed the status: proposed label May 17, 2023

Patrick-Star125 added 6 commits May 29, 2023 22:18

Merge branch 'develop' of https://github.com/PaddlePaddle/Paddle into…

056caf5

… pca

add to_dense

0c426af

fix doc

67e8a8c

add restriction

9ae13cc

update style

ae8154c

lower density

f4bbcd6

luotao1 added the API label May 31, 2023

fix example code

fe91333

sunzhongkai588 reviewed Jun 5, 2023

View reviewed changes

fix example code

8118032

remove return

7f5219c

fix example code

e3ca1d5

refine example code

79c5884

sunzhongkai588 approved these changes Jun 7, 2023

View reviewed changes

sunzhongkai588 previously approved these changes Jun 7, 2023

View reviewed changes

zhengqiwen1997 approved these changes Jun 7, 2023

View reviewed changes

jeff41404 reviewed Jun 7, 2023

View reviewed changes

add TODO

4e4665d

Patrick-Star125 dismissed sunzhongkai588’s stale review via 4e4665d June 8, 2023 14:15

Patrick-Star125 added 2 commits June 9, 2023 09:41

format code

1aac901

format

bd1e26b

jeff41404 approved these changes Jun 13, 2023

View reviewed changes

luotao1 merged commit 4ebb476 into PaddlePaddle:develop Jun 13, 2023
25 checks passed

Patrick-Star125 deleted the pca branch June 18, 2023 07:36

【Hackathon 4 No.9】Add pca_lowrank API to Paddle #53743

【Hackathon 4 No.9】Add pca_lowrank API to Paddle #53743

Conversation

Patrick-Star125 commented May 12, 2023 • edited

PR types

PR changes

Description

paddle-bot bot commented May 12, 2023

paddle-bot bot commented May 12, 2023

Patrick-Star125 commented May 14, 2023 • edited

paddle-ci-bot bot commented May 21, 2023

zkh2016 commented May 23, 2023 • edited

Patrick-Star125 commented May 29, 2023 • edited

zkh2016 commented May 29, 2023

zrr1999 commented May 29, 2023

Patrick-Star125 commented May 30, 2023 • edited

zhengqiwen1997 commented May 31, 2023

luotao1 commented May 31, 2023

Patrick-Star125 commented Jun 4, 2023

zhengqiwen1997 commented Jun 5, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zhengqiwen1997 commented Jun 5, 2023

Patrick-Star125 commented Jun 5, 2023

zhengqiwen1997 commented Jun 6, 2023

Patrick-Star125 commented Jun 6, 2023

zhengqiwen1997 commented Jun 6, 2023

sunzhongkai588 commented Jun 7, 2023

Patrick-Star125 commented Jun 7, 2023

sunzhongkai588 left a comment

Choose a reason for hiding this comment

sunzhongkai588 left a comment

Choose a reason for hiding this comment

zhengqiwen1997 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jeff41404 commented Jun 7, 2023

Patrick-Star125 commented Jun 7, 2023

jeff41404 left a comment

Choose a reason for hiding this comment

Patrick-Star125 commented May 12, 2023 •

edited

Patrick-Star125 commented May 14, 2023 •

edited

zkh2016 commented May 23, 2023 •

edited

Patrick-Star125 commented May 29, 2023 •

edited

Patrick-Star125 commented May 30, 2023 •

edited