Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow for >20 types in AminoAcid class #198

Open
DTRademaker opened this issue Sep 26, 2022 · 11 comments
Open

Allow for >20 types in AminoAcid class #198

DTRademaker opened this issue Sep 26, 2022 · 11 comments
Labels
feature New feature to add nice to have low priority issues
Projects

Comments

@DTRademaker
Copy link
Collaborator

DTRademaker commented Sep 26, 2022

Hello,

I am working with the AminoAcid class (deeprankcore/models/amino_acid.py) because the Residue class (deeprankcore/models/structure.py) requires it, and running into an issue. The common 20 amino-acids are nicely initialized in deeprankcore/domain/amino_acid.py, but I often also have non-standard or unknown amino-acids. It is quite easy to make new amino-acid with new properties, however, the deeprankcore/models/amino_acid.py script is hardcoded to only have one-hot-encoding of length 20, which leaves no room for other amino-acids. Can this one-hot length be changed somehow? Or at least be 21 long, so we have always have a spot for 'other'?

Daniel

@DTRademaker DTRademaker added the feature New feature to add label Sep 26, 2022
@DaniBodor
Copy link
Collaborator

Hi Daniel,
I can look into this. Having a 21st for 'other' should definitely be doable, but I will check if it can be open ended.

Just out of curiosity, how is it possible that you have an amino-acid that is not part of the standard 20?

@DaniBodor DaniBodor added this to To do in UX and docs via automation Oct 13, 2022
@DaniBodor DaniBodor changed the title Fixed nmb of aminoacids Allow for >20 types in AminoAcid class Oct 13, 2022
@DaniBodor
Copy link
Collaborator

Also, is there a limited list of additional amino acids that you get (or at least main ones)? If so, maybe it makes sense to add them to the default list in addition to allowing for extra user-defined ones.

@DaniBodor
Copy link
Collaborator

A quick and dirty solution is to change a = numpy.zeros(20) to a = numpy.zeros(len(amino_acids)) in the one_hot definition of the AminoAcid class (in deeprankcore/models/amino_acid.py). This assumes that you have added your new amino acids to the list at the end of the module.

image

Going forward, we might want to make this more sustainable by creating a class that contains each potential amino acid (similar to what we do for AtomicResidue class.

@DTRademaker
Copy link
Collaborator Author

Hi Daniel, I can look into this. Having a 21st for 'other' should definitely be doable, but I will check if it can be open ended.

Just out of curiosity, how is it possible that you have an amino-acid that is not part of the standard 20?

UX and docs automation moved this from To do to Done Oct 16, 2022
@DTRademaker
Copy link
Collaborator Author

DTRademaker commented Oct 16, 2022

"Just out of curiosity, how is it possible that you have an amino-acid that is not part of the standard 20?"
Well there are many more non-canonical amino-acids, for example Selenocysteine (also known as the "21st proteinogenic amino acid"). In some cases aminoacids form covalent bonds with other molecules such as phosphates or sugar groups and researchers might want to label them differently.
"A quick and dirty solution..."
Yes this is possible, and normally I would also do it like this :), but not desired in the case I am working on. I want to publish and thereby share code to people who have installed the 'standard' deeprankcore code.

After thinking about it, I think this function does not belong to the AminoAcid class at all, but should be incorporated in the features section, there researchers could add as many extra non-canonical aminoacids as they want

@DTRademaker DTRademaker reopened this Oct 16, 2022
UX and docs automation moved this from Done to In progress Oct 16, 2022
@DaniBodor
Copy link
Collaborator

DaniBodor commented Oct 19, 2022

In some cases aminoacids form covalent bonds with other molecules such as phosphates or sugar groups and researchers might want to label them differently.

I would say modified amino acids are not different amino acids. I do agree it would be nice to have an entire new feature module for PTMs, I had thought about that already. Probably not a priority right now, but a nice addition for some point in the future.

Well there are many more non-canonical amino-acids, for example Selenocysteine (also known as the "21st proteinogenic amino acid").

As for the non-canonical amino acids, I am not aware of many. I think there is the one you mentioned and some Lysine variant that only exists in bacteria. These are both quite rare (at least in human, but I think across biology) and very similar to a canonical amino acid counterpart and I doubt much information gets lost by labeling them as their canonical counterpart. Very few projects consider more than 20 core amino acids, unless the non-canonical ones are specifically being studied.

If you feel it's important to have these as part of the main code before publication, feel free to add them to the Amino Acid module (now in deeprankcore/models/amino_acid.py) and create a pull request for this.

@DaniBodor DaniBodor moved this from In progress to To do in UX and docs Oct 19, 2022
@gcroci2 gcroci2 moved this from To do to No status in UX and docs Oct 27, 2022
@github-actions
Copy link

This issue is stale because it has been open for 30 days with no activity. Remove stale label or comment or this will be closed in 7 days.

@github-actions github-actions bot added the stale issue not touched from too much time label Nov 21, 2022
@github-actions
Copy link

This issue was closed because it has been inactive for 7 days since being marked as stale.

UX and docs automation moved this from No status to Done Nov 30, 2022
@DaniBodor
Copy link
Collaborator

DaniBodor commented Dec 8, 2022

@DTRademaker , I just realized that in fact the 2 main (only?) non-canonical amino acids (selenocysteine and pyrrolysine) were already defined in aminoacidlist.py module, just not part of the amino_acid list itself. I have now added them in PR #272.

I noticed that these are currently indexed (one-hot encoded) as their canonical counterparts. Would it be better to give them their own one-hot encoding or is it ok to keep them as is? (I guess ideal would be to make this an option, but not sure it's worth the effort to program it in).

@DaniBodor DaniBodor reopened this Dec 8, 2022
UX and docs automation moved this from Done to In progress Dec 8, 2022
@DaniBodor DaniBodor linked a pull request Dec 8, 2022 that will close this issue
@github-actions github-actions bot removed the stale issue not touched from too much time label Dec 9, 2022
@DaniBodor
Copy link
Collaborator

OK, so including Sec and Pyl actually leads to problems during parsing. For now I will not spend time trying to resolve this. Maybe in the future we can look into it more closely if there is a direct application for it.

@DaniBodor DaniBodor removed this from In progress in UX and docs Dec 9, 2022
@DaniBodor DaniBodor added this to Nice to have in Development via automation Dec 9, 2022
@DaniBodor DaniBodor removed a link to a pull request Dec 9, 2022
@DaniBodor DaniBodor added the nice to have low priority issues label Dec 9, 2022
@github-actions
Copy link

github-actions bot commented Jan 9, 2023

This issue is stale because it has been open for 30 days with no activity. Remove stale label or comment or this will be closed in 7 days.

@github-actions github-actions bot added the stale issue not touched from too much time label Jan 9, 2023
@DaniBodor DaniBodor removed the stale issue not touched from too much time label Jan 10, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature New feature to add nice to have low priority issues
Projects
Development
Nice to have
Development

No branches or pull requests

2 participants