Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Modelangelo v1.0.1 failing if the fasta contains unknown residues "X" #53

Open
sroet opened this issue Jun 15, 2023 · 4 comments
Open

Comments

@sroet
Copy link

sroet commented Jun 15, 2023

Hey,

I have a fasta file where certain residues are unknown and therefor represented with X
such as (1 at the start and 1 at place 105):

>chain 'CM'
XPFKRFVEIGRVALVNYGKDYGRLVVIVDVVDQNRALVDAPDMVRCQINFKRLSLTDIKIDIKRVPKKTTLIKAMEEADVKNKWENSSWGKKLIVQKRRASLNDXDRFKVMLAKIKRGGAIRQELAKLKKTAAA

When trying to build against these fasta sequences you get an internal assertion error:

click here for the log file
2023-06-15 at 17:54:09 | INFO | ModelAngelo with args: {'volume_path': '../sharpened.mrc', 'protein_fasta': '../fasta_files/proteins.fa', 'rna_fasta': '../fasta_files/rna.fa', 'dna_fasta': None, 'output_dir': '20230615_fasta', 'mask_path': None, 'device': '0', 'config_path': None, 'model_bundle_name': 'nucleotides', 'model_bundle_path': None, 'keep_intermediate_results': False, 'pipeline_control': False, 'func': <function main at 0x7f302c6b9430>}
2023-06-15 at 17:54:09 | INFO | Initial C-alpha prediction with args: {'model_checkpoint': 'chkpt.torch', 'bfactor': 0, 'batch_size': 4, 'box_size': 64, 'stride': 16, 'dont_mask_input': True, 'threshold': 0.05, 'save_real_coordinates': False, 'save_cryo_em_grid': False, 'do_nucleotides': True, 'save_backbone_trace': False, 'save_output_grid': False, 'crop': 6, 'log_dir': '/data/public/model_angelo_weights/hub/checkpoints/model_angelo_v1.0/nucleotides/c_alpha', 'map_path': '../sharpened.mrc', 'output_path': '20230615_fasta/see_alpha_output', 'mask_path': None, 'device': '0', 'auto_mask': False}
2023-06-15 at 17:54:09 | INFO | Using model file /data/public/model_angelo_weights/hub/checkpoints/model_angelo_v1.0/nucleotides/c_alpha/model.py
2023-06-15 at 17:54:09 | INFO | Using checkpoint file /data/public/model_angelo_weights/hub/checkpoints/model_angelo_v1.0/nucleotides/c_alpha/chkpt.torch
2023-06-15 at 17:54:10 | INFO | Input structure has shape: (162, 162, 162)
2023-06-15 at 17:54:10 | INFO | Running with these arguments:
2023-06-15 at 17:54:10 | INFO | {'model_checkpoint': 'chkpt.torch', 'bfactor': 0, 'batch_size': 4, 'box_size': 64, 'stride': 16, 'dont_mask_input': True, 'threshold': 0.05, 'save_real_coordinates': False, 'save_cryo_em_grid': False, 'do_nucleotides': True, 'save_backbone_trace': False, 'save_output_grid': False, 'crop': 6, 'log_dir': '/data/public/model_angelo_weights/hub/checkpoints/model_angelo_v1.0/nucleotides/c_alpha', 'map_path': '../sharpened.mrc', 'output_path': '20230615_fasta/see_alpha_output', 'mask_path': None, 'device': '0', 'auto_mask': False}
2023-06-15 at 18:01:55 | INFO | Model prediction done, took 465.11 seconds for 343 sliding windows
2023-06-15 at 18:01:55 | INFO | Average time is 1356.012 ms
2023-06-15 at 18:01:55 | INFO | Starting Cα grid to points...
2023-06-15 at 18:01:56 | INFO | Have 17015 Cα points before pruning and 7629 after pruning
2023-06-15 at 18:01:57 | INFO | Starting P grid to points...
2023-06-15 at 18:01:58 | INFO | Have 10785 P points before pruning and 4260 after pruning
2023-06-15 at 18:01:59 | INFO | Finished inference!
2023-06-15 at 18:01:59 | INFO | GNN model refinement round 1 with args: {'num_rounds': 3, 'crop_length': 200, 'repeat_per_residue': 1, 'esm_model': 'esm1b_t33_650M_UR50S', 'aggressive_pruning': True, 'seq_attention_batch_size': 200, 'fp16': False, 'batch_size': 1, 'voxel_size': 1.0, 'map': '../sharpened.mrc', 'protein_fasta': '../fasta_files/proteins.fa', 'rna_fasta': '../fasta_files/rna.fa', 'dna_fasta': None, 'struct': '20230615_fasta/see_alpha_output/see_alpha_merged_output.cif', 'output_dir': '20230615_fasta/gnn_output_round_1', 'model_dir': '/data/public/model_angelo_weights/hub/checkpoints/model_angelo_v1.0/nucleotides/gnn', 'device': '0', 'write_hmm_profiles': False, 'refine': False}
2023-06-15 at 18:01:59 | INFO | Loaded module from step: 483863
2023-06-15 at 18:02:49 | ERROR | Error in ModelAngelo
Traceback (most recent call last):

  File "/opt/apps/miniconda3/envs/model_angelo/bin/model_angelo", line 33, in <module>
    sys.exit(load_entry_point('model-angelo==1.0.1', 'console_scripts', 'model_angelo')())
    │   │    └ <function importlib_load_entry_point at 0x7f30f12630d0>
    │   └ <built-in function exit>
    └ <module 'sys' (built-in)>
  File "/opt/apps/miniconda3/envs/model_angelo/lib/python3.9/site-packages/model_angelo-1.0.1-py3.9.egg/model_angelo/__main__.py", line 52, in main
    args.func(args)
    │    │    └ Namespace(volume_path='../sharpened.mrc', protein_fasta='../fasta_files/proteins.fa', rna_fasta='../fasta_files/rna.fa', dna_...
    │    └ <function main at 0x7f302c6b9430>
    └ Namespace(volume_path='../sharpened.mrc', protein_fasta='../fasta_files/proteins.fa', rna_fasta='../fasta_files/rna.fa', dna_...
> File "/opt/apps/miniconda3/envs/model_angelo/lib/python3.9/site-packages/model_angelo-1.0.1-py3.9.egg/model_angelo/apps/build.py", line 241, in main
    gnn_output = gnn_infer(gnn_infer_args)
                 │         └ {'num_rounds': 3, 'crop_length': 200, 'repeat_per_residue': 1, 'esm_model': 'esm1b_t33_650M_UR50S', 'aggressive_pruning': Tru...
                 └ <function infer at 0x7f302d433d30>
  File "/opt/apps/miniconda3/envs/model_angelo/lib/python3.9/site-packages/model_angelo-1.0.1-py3.9.egg/model_angelo/gnn/inference.py", line 92, in infer
    protein = get_lm_embeddings_for_protein(lang_model, batch_converter, protein)
              │                             │           │                └ Protein(atom_positions=None, atomc_positions=None, aatype=None, atom_mask=None, atomc_mask=None, residue_index=None, chain_in...
              │                             │           └ <esm.data.BatchConverter object at 0x7f301e126fd0>
              │                             └ ProteinBertModel(
              │                                 (embed_tokens): Embedding(33, 1280, padding_idx=1)
              │                                 (layers): ModuleList(
              │                                   (0): TransformerLayer(
              │                                  ...
              └ <function get_lm_embeddings_for_protein at 0x7f302d433e50>
  File "/opt/apps/miniconda3/envs/model_angelo/lib/python3.9/site-packages/model_angelo-1.0.1-py3.9.egg/model_angelo/data/generate_complete_prot_files.py", line 34, in get_lm_embeddings_for_protein
    protein_with_lm = add_lm_embeddings_to_protein(protein, lm_embeddings)
                      │                            │        └ array([[ 8.0417655e-04,  3.0484083e-01,  6.0511094e-01, ...,
                      │                            │                  -2.1142796e-01, -2.8297421e-01, -9.1318183e-02],
                      │                            │                 ...
                      │                            └ Protein(atom_positions=None, atomc_positions=None, aatype=None, atom_mask=None, atomc_mask=None, residue_index=None, chain_in...
                      └ <function add_lm_embeddings_to_protein at 0x7f302d43d670>
  File "/opt/apps/miniconda3/envs/model_angelo/lib/python3.9/site-packages/model_angelo-1.0.1-py3.9.egg/model_angelo/utils/protein.py", line 897, in add_lm_embeddings_to_protein
    assert len(lm_embeddings) == input_protein.unified_seq_len
               │                 │             └ 5488
               │                 └ Protein(atom_positions=None, atomc_positions=None, aatype=None, atom_mask=None, atomc_mask=None, residue_index=None, chain_in...
               └ array([[ 8.0417655e-04,  3.0484083e-01,  6.0511094e-01, ...,
                         -2.1142796e-01, -2.8297421e-01, -9.1318183e-02],
                        ...

AssertionError: assert len(lm_embeddings) == input_protein.unified_seq_len

If I (just) remove the "X" from the fasta sequence it seems to at least build a model without issue (still have to check if it is reasonable for my complete complex).

What would be the best way of dealing with these unknown residues, just delete them, replace them with glycines, or something else?
Also, it would probably be nice to catch this issue before the start of the C-alpha prediction

@sroet sroet changed the title Modelanglo v1.0.1 failing if the fasta contains unknown residues "X" Modelangelo v1.0.1 failing if the fasta contains unknown residues "X" Jun 15, 2023
@jamaliki jamaliki pinned this issue Jun 15, 2023
@jamaliki
Copy link
Collaborator

Hi Sander,

You are right, this should be handled better by the program. For now, you could put a glycine so that the numbering of the end model is what you expect. But, hopefully I will push out a fix soon.

Best,
Kiarash.

@ColdPopeye
Copy link

What would be the best way of dealing with these unknown residues, just delete them, replace them with glycines, or something else? Also, it would probably be nice to catch this issue before the start of the C-alpha prediction

I think this is an underrated issue, it doesn't make sense that there isn't a pre-check of the input files prior to the start of the prediction. Sometimes weird formatting or an error in a sequence can take 20 minutes to show up.

@jamaliki
Copy link
Collaborator

@ColdPopeye this should be fixed since v1.0.8, does it still give you an issue?

@ColdPopeye
Copy link

@ColdPopeye this should be fixed since v1.0.8, does it still give you an issue?

I have a slightly sillier problem, not sure where and when it comes up. I have sequencing results as a word file from which I copy paste them to make a fasta file. Sometimes formatting gets copied or I mis-paste something (I think because windows and WSL behave strangely). In any case the error only comes after the C-alpha prediction which is a bit annoying.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants