Issue 68 edesolv #238

DarioMarzella · 2021-07-15T18:32:42Z

Energy of desolvation currently working in this branch (complete issue: #issue68).
The organization of deeprank/features/Edesolv.py though might not be great, so I would appreciate comments on that (if needed).

…erator()

Pulling master into issue_68 for testing purposes

…lts. Should test further.

…eeprank.features.rst

coveralls · 2021-07-15T18:39:01Z

Pull Request Test Coverage Report for Build 1127779947

133 of 166 (80.12%) changed or added relevant lines in 2 files are covered.
No unchanged relevant lines lost coverage.
Overall coverage increased (+0.2%) to 77.329%

Changes Missing Coverage	Covered Lines	Changed/Added Lines	%
deeprank/features/Edesolv.py	131	164	79.88%

Totals
Change from base Build 741819231:	0.2%
Covered Lines:	1760
Relevant Lines:	2276

💛 - Coveralls

CunliangGeng · 2021-08-03T14:49:45Z

deeprank/features/Edesolv.py

+
+class Edesolv(FeatureClass):
+
+    # init the class


Please leave a comment about the equations and references used for calculating Edesolv.

Do you mean in the class docstring?

Improved in 6fed0fc . Please let me know if it's not satisfactory yet.

deeprank/features/Edesolv.py

CunliangGeng · 2021-08-03T15:16:21Z

deeprank/features/Edesolv.py

+        chainA = chains[0]
+        chainB = chains[1]
+
+        # Make free_structure fake object and translate the chains away from each other


Split the chains to independent pdbs and use those for SA calculation of free chains

Wouldn't that just cause extra unnecessary I/O (or memory) usage? Can I ask you why this is not fine?

The extra memory usage is quite cheap.
The main issue here is your method cannot make sure the chains are really away from each other, and also it will cause troubles when applied to multiple-chain complexes.

Makes sense. Will change it.

CunliangGeng · 2021-08-03T15:23:36Z

deeprank/features/Edesolv.py

+        self.edesolv_data = {}
+        self.edesolv_data_xyz = {}
+
+        for key, coords in zip(keys, xyz):


The loop here is using the contact atoms to calculate Edsolv. The contact atoms are dependent on distance cutoff. So the Edsolv will also be dependent on distance cutoff. It is not a great idea to calculate Edsolv in this way.
You could keep these code and rewrite one as the way used in Haddock, then compare the results to check the difference.

So would you want me to calculate the Edesolv for each atom in the proteins (as HADDOCK does), although in the end we only use the ones in the grid? Do you want it to be done constitutively or just for the comparison with HADDOCK?

Aha, I forgot we need only the atoms in the grid. It's fine then to calculate the Edesolv for only contact atoms with one place to improve:
Make sure the distance_cutoff for defining contact is same as the one used to get grid atoms. You should transfer the value in DataGenerator.create_database(contact_distance=8.5) to the "# get the contact atoms" step in this code.

To make sure the results are correct, you also need to compare the results with Haddock by giving a big enough distance cutoff to get all atoms.

Ok.
Thank you very much for the tips, will do.

deeprank/features/Edesolv.py

CunliangGeng

Nice work, thanks.
I summarise the major comments below:

Try to calculate the Edsolv as how Haddock does and get rid of the dependence on contact atoms
Try to use absolute difference rather than pearson correlation coefficient to compare the results between the code and Haddock
Write an unit test to test the code

…d chain2 arguments to compute_feature

deeprank/features/Edesolv.py

DarioMarzella · 2021-08-09T13:50:29Z

Write an unit test to test the code

For point 3, isn't it better if I simply add Edesolv as a feature in test_generate.py? In this way, it will be tested along with the other features and there will be no need to change its specific unit test.

CunliangGeng · 2021-08-09T17:57:37Z

Write an unit test to test the code

For point 3, isn't it better if I simply add Edesolv as a feature in test_generate.py? In this way, it will be tested along with the other features and there will be no need to change its specific unit test.

Yep, makes sense :-)

…ne not containing the interacting chain

CunliangGeng · 2021-08-13T12:56:34Z

@DarioMarzella Please leave me a message when all requested changes have been handled :-)
The tests failed and please look into it.

DarioMarzella · 2021-08-13T13:05:21Z

@DarioMarzella Please leave me a message when all requested changes have been handled :-)
The tests failed and please look into it.

Yes, I am waiting for the comparison with HADDOCK to update you :)

DarioMarzella · 2021-08-17T11:22:10Z

@DarioMarzella Please leave me a message when all requested changes have been handled :-)
The tests failed and please look into it.

I just got the result, and seems like the Edesolv I get from DeepRank by setting a grid big enough to take the whole protein correlates a bit worse with haddock than the one I previously got by taking into account the interface only. This makes me wonder if HADDOCK is using the whole proteins or the surface atoms only for the Edesolv.
(Please note I had to use a different set of structures previously used for comparison. I previously used all the CAPRI set structures, now I had to use only some BM5 structures)

Last result (using all-atom Edesolv):
Pearson correlation: 0.92
Mean absolute difference: 4.13
Plot:
haddock_vs_deeprank.pdf

Previous results (using interface-only Edesolv):
Pearson correlation: 0.98
Mean absolute difference: -2.93
Plot:
haddock_vs_deeprank_capri.pdf

DarioMarzella · 2021-08-17T11:41:49Z

Update: it might be that HADDOCK uses different thresholds to calculate the Edesolv.
My values correlate better with HADDOCK itw and worse with HADDOCK it1 and it0.

HADDOCK it0 models:
Pearson corr: 0.89
Mean Absolute difference: 5.05
Plot:

HADDOCK it1 models:
Pearson: 0.95
Mean Absolute Difference: 2.98
Plot:

HADDOCK itw models:
Pearson: 0.97
Mean Absolute value: 3.13
Plot:

CunliangGeng · 2021-08-19T15:38:58Z

@DarioMarzella Thanks for the results! If we're using the same equations and parameters as Haddock, the results should be same or the difference should be at the level of precision, e.g. 0.01(?). Did you check the Haddock code? If not, you'd better to have a look and also get clear with your wonder "This makes me wonder if HADDOCK is using the whole proteins or the surface atoms only for the Edesolv." (I don't remember and cannot provide a direct answer here :-(

And note the difference might be also from the calculation of surface area, Haddock does not use biopython for that. If the major part of difference is from area calculation, it's fine then and we should leave a note in the documentation.

CunliangGeng · 2023-11-24T16:00:29Z

This repo is going to be archived #260. So closing this PR now.

DarioMarzella and others added 8 commits August 24, 2020 10:54

Modified .gitignore to ignore development folder ./Edesolv

4937f61

Added development direcotry ./Edesolv/

e76671a

Add first version of class_edesolv.py

9d2f91b

Merge remote-tracking branch 'origin/master' into issue_68_Edesolv

5ae370a

Add Edesolv.py to deeprank/features. Still have to test it in DataGen…

7c284df

…erator()

Merge pull request #224 from DeepRank/master

59faeab

Pulling master into issue_68 for testing purposes

Updated Edesolv.py. First test with only CAPRI T35 gave positive resu…

8edff4d

…lts. Should test further.

Add docstrings to deeprank/features/Edesolv.py. Add Edesolv to docs/d…

99ea3e5

…eeprank.features.rst

DarioMarzella requested review from CunliangGeng and NicoRenaud July 15, 2021 18:32

Remove folder 'Edesolv' used for testing during the development

1d14acb