Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Try uppercase atom names when guessing the mass #2331

Merged
merged 10 commits into from
Aug 28, 2019

Conversation

RMeli
Copy link
Member

@RMeli RMeli commented Aug 20, 2019

Fixes #2265

This is an attempt to get rid of the UserWarning: Failed to guess the mass for the following atom types: Ca warning when PDB files contain metal atoms.

It is probably not the best approach, but it's the one that minimize side effects (such as changing tables.py). Please let me know if you want to solve issue #2265 in a different way.

Changes made in this Pull Request:

  • Try uppercase atom names when guessing the mass

PR Checklist

  • Tests?
  • Docs?
  • CHANGELOG updated?
  • Issue raised/referenced?

@codecov
Copy link

codecov bot commented Aug 20, 2019

Codecov Report

Merging #2331 into develop will increase coverage by <.01%.
The diff coverage is 100%.

Impacted file tree graph

@@             Coverage Diff             @@
##           develop    #2331      +/-   ##
===========================================
+ Coverage    89.89%   89.89%   +<.01%     
===========================================
  Files          173      173              
  Lines        21778    21784       +6     
  Branches      2861     2861              
===========================================
+ Hits         19577    19583       +6     
  Misses        1609     1609              
  Partials       592      592
Impacted Files Coverage Δ
package/MDAnalysis/topology/guessers.py 100% <100%> (ø) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 57bc183...e9e1680. Read the comment docs.

@richardjgowers
Copy link
Member

I think there's a thing with carbon-alpha which will make this cause lots of tears down the line @orbeckst ?

@RMeli
Copy link
Member Author

RMeli commented Aug 21, 2019

Yes, I followed the discussion in Issue #1808. However CA has a mass of 40.08000 in TABLE_MASSES . Additionally, in TABLE_ATOMELEMENTS, which is described at the translation of atomnames to types/element, CAL, C0 and CA2+ are all mapped to CA.

I agree we probably need to be more careful and I'm happy to re-work this PR in order to solve the problem at a deeper level.

Maybe the element names in tables.py should only be the actual element names (first letter uppercase, second letter lowercase) so that there is no confusion? The problem of CA being alpha-carbon could then be solved by the parser if it needs to guess the element (because that information is omitted) or if the element name provided is not the standard name; for example CA as HETATM is Ca while CA in a chain is alpha-carbon, or something on these lines.

However, I'm by no means very informed on all the corner cases and therefore I'd be happy to discuss it further and possibly fix it the way you prefer.

@lilyminium
Copy link
Member

for example CA as HETATM is Ca while CA in a chain is alpha-carbon, or something on these lines.

This would rely on the file format being a PDB, which is just one of many formats MDAnalysis supports. Recently #2314 was merged, which I think would get rid of your UserWarning but not in the way you desire: the mass guessed would be carbon's.

Probably the best way to get around it would be to refine guessing the element further. The easiest way I can think of would be passing in the residue name and checking it against a list of non-biological-looking names, e.g. the element name itself, or ION, etc... I personally feel that checking against a list of biological-looking names could run into too many issues with modified amino acids.

Other ambiguous elements: Hg, He, Ne, Mn (virtual sites) ...

@RMeli
Copy link
Member Author

RMeli commented Aug 22, 2019

Thanks for pointing out your PR, @lilyminium. I think you PR was already in my bugfix/metals, but I'll double check on my laptop.

I believe it would be nice to have a standard and clean tables.py (where there is no ambiguity on the element name: C is any carbon, Ca is calcium, CA does not exist) and then refine the atom guess in the parsers for different file formats (for which the complexity and the details of this guess will be different). For example passing the residue name for PDB files sound a good idea to me (but I'm by no mean an expert)!

@lilyminium
Copy link
Member

My apologies, I misunderstood the issue. @richardjgowers by the point an atom's element is Ca (either through user/file specification or element guessing), it makes sense that the atom's mass should correspond. If it's causing carbon-alpha grief, it would be better to fix the element guessing rather than hack the mass lookup.

I agree with you @RMeli, the capitalisation of elements is a bit all over the place and results in unnecessary errors like #1808 and #2265 . I see I've contributed to this confusion by uppercasing tables.elements. Maybe everything should be in Capitalised style?

@IAlibay
Copy link
Member

IAlibay commented Aug 22, 2019

Just jumping in the discussion here.

I would agree with #2331 (comment) that a solution that involves residue names might be useful here. Most of the time resname == atomname for metals (if you strip out numbers/signs and match any case) in the PDB. A check for this would allow for Ca, Mn, etc... to be picked up as metals when they are intended to do so.

@richardjgowers
Copy link
Member

Yeah all the guessing functions can be written to use more than just one other array of information. So really PDB / whatever parsing could be written to parse all the info from a file then do guessing based on combinations of information.

Also here masses are guessed from an atomtype, when really the problem is guessing element from an atomtype. (Element to mass is trivial)

So maybe a better solution is for PDBParser to have a better element guesser (one that actually ensures that elements are returned too...), then mass->element should just be a trivial dict lookup?

@RMeli
Copy link
Member Author

RMeli commented Aug 22, 2019

So maybe a better solution is for PDBParser to have a better element guesser (one that actually ensures that elements are returned too...), then mass->element should just be a trivial dict lookup?

I agree, this is probably the best option.

I think the required steps are the following:

  • Duplicate all properties in tables.py where the element name is capitalised (for temporary backward compatibility) and associate them to the correct element name
  • Improve the atoms type guess in PDBReader
    • Try to use the element name PDB field, if present
    • Raise a warning and use other information if the element name PDB field is not present
  • Improve the atom type guess for other formats, if needed
  • Cleanup tables.py so that they only contain standard element names

@orbeckst
Copy link
Member

orbeckst commented Aug 22, 2019

I agree with the general discussion that parsers should be assigning elements to atoms because this the place to encode domain-specific knowledge.

There's the caveat that coarse grained systems cannot be assigned an element.

People will likely also want to have a way to disable any guessing, e.g., so that they can fix-up the topology by themselves.

Not having elements is not a big deal, but having no masses will lead to AtomGroup and friends not having a whole bunch of methods (see also #1845) . So we should have a decent way to warn users and give suggestions what to do. Something along the line of: "you can manually assign uniform mass = 1 to all particles with ... or better: use a topology format that contains masses explicitly."

@RMeli
Copy link
Member Author

RMeli commented Aug 23, 2019

Thanks for the comments @orbeckst. I agree that the guessing should be controllable by the user.

I could give it a try in the next few weeks.

Copy link
Member

@richardjgowers richardjgowers left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok this is a bandaid (as discussed above) but I think it helps a little

@richardjgowers richardjgowers merged commit e059c92 into MDAnalysis:develop Aug 28, 2019
@richardjgowers
Copy link
Member

thanks @RMeli !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

Metal Atoms in PDB not properly recognized
5 participants