SIGN算法的数据预处理错误 #141

chrisxu2016 · 2021-08-13T08:13:30Z

setxor错误：举例输入 setxor(a=[1, 0], b=[0, 2])，将会得到 [0, 1, 0, 2], []，实际上按照bond_graph_base的生成方式，只需要取a[0]和 b[1]即可

PaddleHelix/apps/drug_target_interaction/sign/dataset.py

Line 149 in e5578f7

bodyxor, link = setxor(body1, body2)
这里输出的atoms使用的是atom在特征矩阵的维度，与后面的atom_type不符, 提供的处理好的数据是没有问题的(https://www.dropbox.com/sh/68vc7j5cvqo4p39/AAB_96TpzJWXw6N0zxHdsppEa)

PaddleHelix/apps/drug_target_interaction/sign/preprocess_pdbbind.py

Line 280 in e5578f7

return lig_size, coords, feas, atoms

存在疑惑的地方：
3. 如果 a边:[0, 1 ], b边[1, 0], 则c边为[0, 0], 如果取dist_mat[0, 0],则c边长度为inf，计算可得夹角为180度（encode为5）但按照其它的边的夹角构造方式，则夹角应该为0度（encode为0）

PaddleHelix/apps/drug_target_interaction/sign/dataset.py

Line 152 in e5578f7

c = dist_mat[bodyxor[0], bodyxor[1]]

chrisxu2016 · 2021-08-15T01:11:19Z

更新一下，我修正这些问题之后的结果是
RMSE: 1.238, MAE: 0.978, SD: 1.23, R: 0.824
论文结果: RMSE: 1.316(0.031), MAE: 1.027(0.025), SD: 1.312(0.035), R:0.797(0.012)

agave233 · 2021-08-16T09:08:10Z

感谢指出的问题。

这部分确实是数据处理部分存在的bug，今天进行了修复并重新实验，在我们机器环境下与论文结果相差不大，结果如下：

RMSE: 1.311(0.021), MAE: 1.021(0.009), SD: 1.312(0.027), R:0.798(0.008)

在特征抽取时针对不同数据格式我们当时尝试了不同的方法，在release代码时放错了函数，实际生成的预处理数据集没有问题
我们在代码中其实有加入去除图中情况的处理策略，来避免a_01和a_10相连，同样在整理发布代码时有所疏忽。

bond_graph_base[range(num_bonds), [indices.index([x[1],x[0]]) for x in indices]] = 0

我们会在近期同步修正以上问题后的代码~

chrisxu2016 · 2021-08-17T07:51:46Z

好的，我试下去掉自邻边看看，这样做除了减少计算量，有什么其它好处吗？

PS. 提两个小的建议，可以加速数据预处理

bond2bond的邻接矩阵计算，可以采用gpu来加速计算两个矩阵的外积,例如使用cupy

import cupy as cp

bond_graph_base = cp.matmul(cp.array(assignment_b2a, dtype='int8'), cp.array(assignment_a2b, dtype='int8')).get()

PaddleHelix/apps/drug_target_interaction/sign/dataset.py

Line 139 in e5578f7

bond_graph_base = assignment_b2a @ assignment_a2b

两边的夹角计算可以用numpy来并行计算

我采用了这两个策略后，平均每个样本的处理耗时0.1S左右

agave233 · 2021-08-17T12:11:43Z

好的，我试下去掉自邻边看看，这样做除了减少计算量，有什么其它好处吗？

PS. 提两个小的建议，可以加速数据预处理

bond2bond的邻接矩阵计算，可以采用gpu来加速计算两个矩阵的外积,例如使用cupy
import cupy as cp

bond_graph_base = cp.matmul(cp.array(assignment_b2a, dtype='int8'), cp.array(assignment_a2b, dtype='int8')).get()
PaddleHelix/apps/drug_target_interaction/sign/dataset.py

Line 139 in e5578f7

bond_graph_base = assignment_b2a @ assignment_a2b

两边的夹角计算可以用numpy来并行计算

我采用了这两个策略后，平均每个样本的处理耗时0.1S左右

这里是类似于atom graph里去掉self-loop，主要目的还是让模型学习每个target atom/bond的周围邻居的空间分布。或者单独再划分一个domain来加入这种『自邻边』也是可以的，我们之后也准备进一步尝试一下不同的策略。

特别感谢提出的一系列建议👍🏻

chrisxu2016 changed the title ~~SIGN算法的数据预处理3处错误~~ SIGN算法的数据预处理错误 Aug 13, 2021

chrisxu2016 closed this as completed Aug 20, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SIGN算法的数据预处理错误 #141

SIGN算法的数据预处理错误 #141

chrisxu2016 commented Aug 13, 2021 •

edited

chrisxu2016 commented Aug 15, 2021

agave233 commented Aug 16, 2021

chrisxu2016 commented Aug 17, 2021 •

edited

agave233 commented Aug 17, 2021

SIGN算法的数据预处理错误 #141

SIGN算法的数据预处理错误 #141

Comments

chrisxu2016 commented Aug 13, 2021 • edited

chrisxu2016 commented Aug 15, 2021

agave233 commented Aug 16, 2021

chrisxu2016 commented Aug 17, 2021 • edited

agave233 commented Aug 17, 2021

chrisxu2016 commented Aug 13, 2021 •

edited

chrisxu2016 commented Aug 17, 2021 •

edited