# Protein - Structure Mapping, Alignments, and Visualization
This notebook gives an example of how to **map a single protein sequence to its structure**, along with conducting sequence alignments and visualizing the mutations.
<p>
<div class="alert alert-info">
**Input:** Protein ID + amino acid sequence + mutated sequence(s)
</div>
<div class="alert alert-info">
**Output:** Representative protein structure, sequence alignments, and visualization of mutations
</div>
</p>

## Imports

In [1]:
import sys
import logging

In [2]:
# Import the Protein class
from ssbio.core.protein import Protein

In [3]:
# Printing multiple outputs per cell
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

### Logging

Set the logging level in `logger.setLevel(logging.<LEVEL_HERE>)` to specify how verbose you want the pipeline to be. Debug is most verbose.

- `CRITICAL`
     - Only really important messages shown
- `ERROR`
     - Major errors
- `WARNING`
     - Warnings that don't affect running of the pipeline
- `INFO` (default)
     - Info such as the number of structures mapped per gene
- `DEBUG`
     - Really detailed information that will print out a lot of stuff
     
<p><div class="alert alert-warning">**Warning:** `DEBUG` mode prints out a large amount of information, especially if you have a lot of genes. This may stall your notebook!</div></p>

In [4]:
# Create logger
logger = logging.getLogger()
logger.setLevel(logging.INFO)  # SET YOUR LOGGING LEVEL HERE #

In [87]:
# Other logger stuff for Jupyter notebooks
handler = logging.StreamHandler(sys.stderr)
formatter = logging.Formatter('[%(asctime)s] [%(name)s] %(levelname)s: %(message)s', datefmt="%Y-%m-%d %H:%M")
handler.setFormatter(formatter)
logger.handlers = [handler]

## Initialization of the project

Set these three things:

- `ROOT_DIR`
    - The directory where a folder named after your `PROTEIN_ID` will be created
- `PROTEIN_ID`
    - Your protein ID
- `PROTEIN_SEQ`
    - Your protein sequence
    
A directory will be created in `ROOT_DIR` with your `PROTEIN_ID` name. The folders are organized like so:
```
    ROOT_DIR
    └── PROTEIN_ID
        ├── sequences  # Protein sequence files, alignments, etc.
        └── structures  # Protein structure files, calculations, etc.

```

In [88]:
# SET FOLDERS AND DATA HERE
import tempfile
ROOT_DIR = tempfile.gettempdir()

PROTEIN_ID = 'SRR1753782_00918'
PROTEIN_SEQ = 'MSKQQIGVVGMAVMGRNLALNIESRGYTVSVFNRSREKTEEVIAENPGKKLVPYYTVKEFVESLETPRRILLMVKAGAGTDAAIDSLKPYLEKGDIIIDGGNTFFQDTIRRNRELSAEGFNFIGTGVSGGEEGALKGPSIMPGGQKDAYELVAPILTKIAAVAEDGEPCVTYIGADGAGHYVKMVHNGIEYGDMQLIAEAYSLLKGGLNLSNEELANTFTEWNNGELSSYLIDITKDIFTKKDEDGNYLVDVILDEAANKGTGKWTSQSALDLGEPLSLITESVFARYISSLKAQRVAASKVLSGPKAQPAGDKAEFIEKVRRALYLGKIVSYAQGFSQLRAASDEYHWDLNYGEIAKIFRAGCIIRAQFLQKITDAYAENADIANLLLAPYFKKIADEYQQALRDVVAYAVQNGIPVPTFSAAVAYYDSYRAAVLPANLIQAQRDYFGAHTYKRTDKEGIFHTEWLE'

In [89]:
# Create the Protein object
my_protein = Protein(ident=PROTEIN_ID, root_dir=ROOT_DIR)

In [90]:
# Load the protein sequence
my_protein.load_manual_sequence(seq=PROTEIN_SEQ, ident='WT', write_fasta_file=True, set_as_representative=True)

DEBUG - [2017-03-06 16:30] [ssbio.core.protein] "WT: set as representative sequence"


<SeqProp WT at 0x7faabd47aa58>

## Mapping sequence -> structure

Since the sequence has been provided, we just need to BLAST it to the PDB.

<p><div class="alert alert-info">**Note:** These methods do not download any 3D structure files.</div></p>

### Methods

#### `blast_representative_sequence_to_pdb`
This will BLAST the representative sequence against the entire PDB, and return significant hits. XML files of the BLAST results are saved in the respective sequence folders for a protein.

<p><div class="alert alert-info">**Warning:** A PDB BLAST may return hits in other organisms.</div></p>

- `seq_ident_cutoff`
    - Default: `0`
    - From 0 to 1
- `evalue`
    - Default: `0.0001`
    - Significance of BLAST results
- `all_genes`
    - Default: `False`
    - Set to `True` if you want all genes and their sequences BLASTed
    - Set to `False` if you only want to BLAST sequences that did not have any PDBs mapped to them already
- `display_link`
    - Default: `False`
    - Set to `True` if you want a clickable HTML link to be printed
- `force_rerun`
    - Default: `False`
    - Set to `True` if you want to ignore any existing XML results and run the BLAST again
- What's saved?
    - Protein structures
    ```python
    my_protein.structures
    ```
    - DataFrames
    ```python
    my_protein.df_pdb_blast
    ```

In [33]:
# Mapping using BLAST
my_protein.blast_representative_sequence_to_pdb(seq_ident_cutoff=0.9, evalue=0.00001)
my_protein.df_pdb_blast.head()

2017-03-06 16:25:56,469 ssbio.databases.pdb DEBUG    /tmp/SRR1753782_00918/sequences/WT_blast_pdb.xml: Loaded existing BLAST XML results
2017-03-06 16:25:56,469 ssbio.databases.pdb DEBUG    2W90: does not meet sequence identity cutoff
2017-03-06 16:25:56,470 ssbio.databases.pdb DEBUG    2W8Z: does not meet sequence identity cutoff
2017-03-06 16:25:56,470 ssbio.databases.pdb DEBUG    2IZ1: does not meet sequence identity cutoff
2017-03-06 16:25:56,471 ssbio.databases.pdb DEBUG    2IZ0: does not meet sequence identity cutoff
2017-03-06 16:25:56,471 ssbio.databases.pdb DEBUG    2IYO: does not meet sequence identity cutoff
2017-03-06 16:25:56,472 ssbio.databases.pdb DEBUG    2IYP: does not meet sequence identity cutoff
2017-03-06 16:25:56,472 ssbio.databases.pdb DEBUG    2PGD: does not meet sequence identity cutoff
2017-03-06 16:25:56,473 ssbio.databases.pdb DEBUG    1PGQ: does not meet sequence identity cutoff
2017-03-06 16:25:56,473 ssbio.databases.pdb DEBUG    1PGP: does not meet sequen

['2zyd', '2zya', '3fwn', '2zyg']

Unnamed: 0_level_0,pdb_chain_id,hit_score,hit_evalue,hit_percent_similar,hit_percent_ident,hit_num_ident,hit_num_similar
pdb_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
2zyd,A,2319.0,0.0,0.987179,0.963675,451,462
2zyd,B,2319.0,0.0,0.987179,0.963675,451,462
2zya,A,2319.0,0.0,0.987179,0.963675,451,462
2zya,B,2319.0,0.0,0.987179,0.963675,451,462
3fwn,A,2312.0,0.0,0.987179,0.963675,451,462


## Downloading and ranking structures

### Methods

#### `pdb_downloader_and_metadata`
Download **all** structures per protein. This also adds metadata to each PDB object in the list of structures.

<p><div class="alert alert-warning">**Warning:** Don't run this if you don't need all PDB structures - just set representative structures below if you want 1 structure per protein.</div></p>

- `outdir`
    - Default: `None`
    - Set this to a custom location if you want to save PDB files outside the GEM-PRO project folder
- `pdb_file_type`
    - Default: `'cif'` (set in Protein initialization, but can be changed here)
    - `'pdb'`, `'pdb.gz'`, `'mmcif'`, `'cif'`, `'cif.gz'`, `'xml.gz'`, `'mmtf'`, `'mmtf.gz'` - File type for files downloaded from the PDB.
- `force_rerun`
    - Default: `False`
    - Set to `True` if you want to re-download PDB files.
- What's saved?
    - Additional metadata per structure
    ```python
    my_protein.structures
    ```
    - DataFrames
    ```python
    my_protein.df_pdb_metadata
    ```

In [8]:
# Download all mapped PDBs and gather the metadata
my_protein.pdb_downloader_and_metadata()
my_protein.df_pdb_metadata.head(2)

['2zyd', '2zya', '3fwn', '2zyg']

Unnamed: 0_level_0,pdb_title,description,experimental_method,mapped_chains,resolution,chemicals,date,taxonomy_name,structure_file
pdb_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
2zyd,Dimeric 6-phosphogluconate dehydrogenase compl...,"6-phosphogluconate dehydrogenase, decarboxylat...",X-RAY DIFFRACTION,A;B,1.5,GLO,2009-09-01;2011-07-13;2014-01-22,Escherichia coli,2zyd.cif
2zya,Dimeric 6-phosphogluconate dehydrogenase compl...,"6-phosphogluconate dehydrogenase, decarboxylat...",X-RAY DIFFRACTION,A;B,1.6,6PG,2009-09-01;2011-07-13;2014-01-22,Escherichia coli,2zya.cif


#### `set_representative_structure`
Rank available structures, run QC/QA, download and clean the final structure.

<p><div class="alert alert-info">**Note:** PDBs don't need to be downloaded before running this step. This is useful to limit the number of structures downloaded from the PDB.</div></p>

- `pdb_file_type`
    - Default: `'cif'` (set in Protein initialization, but can be changed here)
    - `'pdb'`, `'pdb.gz'`, `'mmcif'`, `'cif'`, `'cif.gz'`, `'xml.gz'`, `'mmtf'`, `'mmtf.gz'` - File type for files downloaded from the PDB.
- `engine`
    - Default: `'needle'`
    - Set to `'biopython'` if you want to utilize Biopython's built-in pairwise alignment algorithm.
- `always_use_homology`
    - Default: `False`
    - Set to `True` if you always want to use homology models.
- `seq_ident_cutoff`
    - Default: `0.5`
    - QC/QA: sets the minimum sequence identity a structure has to have to be selected as representative.
- `allow_missing_on_termini`
    - Default: `0.2`
    - QC/QA: Percentage of the total length of the reference sequence which will be ignored when checking for modifications (mutations, deletions, insertions, or unresolved residues). Example: if `0.1`, and reference sequence is 100 AA, then only residues 5 to 95 will be checked for modifications.
- `allow_mutants`
    - Default: `True`
    - QC/QA: set to `True` if point mutations within the structure should be allowed.
- `allow_deletions`
    - Default: `False`
    - QC/QA: set to `True` if deletions within the structure should be allowed.
- `allow_insertions`
    - Default: `False`
    - QC/QA: set to `True` if insertions within the structure should be allowed.
- `allow_unresolved`
    - Default: `True`
    - QC/QA: set to `True` if unresolved regions within the structure should be allowed.
- `force_rerun`
    - Default: `False`
    - QC/QA: set to `True` if pairwise alignments and structure cleaning should be rerun even if files exist.
- What's saved?
    - Representative protein structures
    ```python
    my_protein.representative_structure
    ```

In [9]:
# Set representative structures
my_protein.set_representative_structure()

<StructProp 2zyd-A at 0x7f0581a5c828>

## Loading and aligning new sequences

You can load additional sequences into this protein object and align them to the representative sequence.

### Methods

#### `load_manual_sequence`
Loads in a new sequence string (same as what we used above)

In [10]:
# Input your mutated sequence and load it
mutated_protein1_id = 'N17P_SNP'
mutated_protein1_seq = 'MSKQQIGVVGMAVMGRPLALNIESRGYTVSVFNRSREKTEEVIAENPGKKLVPYYTVKEFVESLETPRRILLMVKAGAGTDAAIDSLKPYLEKGDIIIDGGNTFFQDTIRRNRELSAEGFNFIGTGVSGGEEGALKGPSIMPGGQKDAYELVAPILTKIAAVAEDGEPCVTYIGADGAGHYVKMVHNGIEYGDMQLIAEAYSLLKGGLNLSNEELANTFTEWNNGELSSYLIDITKDIFTKKDEDGNYLVDVILDEAANKGTGKWTSQSALDLGEPLSLITESVFARYISSLKAQRVAASKVLSGPKAQPAGDKAEFIEKVRRALYLGKIVSYAQGFSQLRAASDEYHWDLNYGEIAKIFRAGCIIRAQFLQKITDAYAENADIANLLLAPYFKKIADEYQQALRDVVAYAVQNGIPVPTFSAAVAYYDSYRAAVLPANLIQAQRDYFGAHTYKRTDKEGIFHTEWLE'

my_protein.load_manual_sequence(ident=mutated_protein1_id, seq=mutated_protein1_seq)

<SeqProp N17P_SNP at 0x7fe5246abef0>

In [11]:
# Input another mutated sequence and load it
mutated_protein2_id = 'Q4S_N17P_SNP'
mutated_protein2_seq = 'MSKSQIGVVGMAVMGRPLALNIESRGYTVSVFNRSREKTEEVIAENPGKKLVPYYTVKEFVESLETPRRILLMVKAGAGTDAAIDSLKPYLEKGDIIIDGGNTFFQDTIRRNRELSAEGFNFIGTGVSGGEEGALKGPSIMPGGQKDAYELVAPILTKIAAVAEDGEPCVTYIGADGAGHYVKMVHNGIEYGDMQLIAEAYSLLKGGLNLSNEELANTFTEWNNGELSSYLIDITKDIFTKKDEDGNYLVDVILDEAANKGTGKWTSQSALDLGEPLSLITESVFARYISSLKAQRVAASKVLSGPKAQPAGDKAEFIEKVRRALYLGKIVSYAQGFSQLRAASDEYHWDLNYGEIAKIFRAGCIIRAQFLQKITDAYAENADIANLLLAPYFKKIADEYQQALRDVVAYAVQNGIPVPTFSAAVAYYDSYRAAVLPANLIQAQRDYFGAHTYKRTDKEGIFHTEWLE'

my_protein.load_manual_sequence(ident=mutated_protein2_id, seq=mutated_protein2_seq)

<SeqProp Q4S_N17P_SNP at 0x7fe5244fb550>

#### `pairwise_align_sequences_to_representative`
Conduct **pairwise alignments** of all sequences in the `sequences` attribute to the representative sequence.

- `gapopen`
    - Default: `10`
    - Gap open penalty
- `gapextend`
    - Default: `0.5`
    - Gap extension penalty
- `outdir`
    - Default: `None`
    - Set this to a custom location if you want to save alignment files outside the Protein folder
- `engine`
    - Default: `'needle'`
    - Set to `'biopython'` if you want to utilize Biopython's built-in pairwise alignment algorithm.
- `parse`
    - Default: `True`
    - Parse the results of the pairwise alignments, and store the information as Biopython SeqRecord annotations
- `force_rerun`
    - Default: `False`
    - Set to `True` if pairwise alignments and structure cleaning should be rerun even if files exist.
- What's saved?
    - Sequence alignments (as `Biopython.Align.MultipleSeqAlignment` objects)
    ```python
    my_protein.representative_sequence.sequence_alignments
    ```

In [12]:
# Conduct pairwise sequence alignments
my_protein.pairwise_align_sequences_to_representative()

In [13]:
# View the stored information for one of the alignments
my_protein.representative_sequence.sequence_alignments
my_protein.representative_sequence.sequence_alignments[0].annotations

str(my_protein.representative_sequence.sequence_alignments[0][0].seq)
str(my_protein.representative_sequence.sequence_alignments[0][1].seq)

[<<class 'Bio.Align.MultipleSeqAlignment'> instance (2 records of length 468, SingleLetterAlphabet()) at 7fe5244fb6a0>,
 <<class 'Bio.Align.MultipleSeqAlignment'> instance (2 records of length 468, SingleLetterAlphabet()) at 7fe5244fb9e8>]

{'a_seq': 'WT',
 'b_seq': 'N17P_SNP',
 'deletions': [],
 'insertions': [],
 'mutations': [('N', 17, 'P')],
 'percent_gaps': 0.0,
 'percent_identity': 99.8,
 'percent_similarity': 99.8,
 'score': 2381.0}

'MSKQQIGVVGMAVMGRNLALNIESRGYTVSVFNRSREKTEEVIAENPGKKLVPYYTVKEFVESLETPRRILLMVKAGAGTDAAIDSLKPYLEKGDIIIDGGNTFFQDTIRRNRELSAEGFNFIGTGVSGGEEGALKGPSIMPGGQKDAYELVAPILTKIAAVAEDGEPCVTYIGADGAGHYVKMVHNGIEYGDMQLIAEAYSLLKGGLNLSNEELANTFTEWNNGELSSYLIDITKDIFTKKDEDGNYLVDVILDEAANKGTGKWTSQSALDLGEPLSLITESVFARYISSLKAQRVAASKVLSGPKAQPAGDKAEFIEKVRRALYLGKIVSYAQGFSQLRAASDEYHWDLNYGEIAKIFRAGCIIRAQFLQKITDAYAENADIANLLLAPYFKKIADEYQQALRDVVAYAVQNGIPVPTFSAAVAYYDSYRAAVLPANLIQAQRDYFGAHTYKRTDKEGIFHTEWLE'

'MSKQQIGVVGMAVMGRPLALNIESRGYTVSVFNRSREKTEEVIAENPGKKLVPYYTVKEFVESLETPRRILLMVKAGAGTDAAIDSLKPYLEKGDIIIDGGNTFFQDTIRRNRELSAEGFNFIGTGVSGGEEGALKGPSIMPGGQKDAYELVAPILTKIAAVAEDGEPCVTYIGADGAGHYVKMVHNGIEYGDMQLIAEAYSLLKGGLNLSNEELANTFTEWNNGELSSYLIDITKDIFTKKDEDGNYLVDVILDEAANKGTGKWTSQSALDLGEPLSLITESVFARYISSLKAQRVAASKVLSGPKAQPAGDKAEFIEKVRRALYLGKIVSYAQGFSQLRAASDEYHWDLNYGEIAKIFRAGCIIRAQFLQKITDAYAENADIANLLLAPYFKKIADEYQQALRDVVAYAVQNGIPVPTFSAAVAYYDSYRAAVLPANLIQAQRDYFGAHTYKRTDKEGIFHTEWLE'

#### `representative_sequence.sequence_mutation_summary`
Summarize all mutations found. Returns 2 dictionaries, `single_counter` and `fingerprint_counter`.

- `single_counter` is a dictionary of point mutations and a list of sequences the point mutation shows up in.
    - Example:
    ```python
    {('A', 24, 'V'): ['Seq1', 'Seq2', 'Seq4'], 
    ('R', 33, 'T'): ['Seq2']}
    ```
- `fingerprint_counter` is a dictionary of groups of point mutations and a list of sequences the groups show up in.
    - This reports which sequences have the specific combinations (or "fingerprints") of point mutations
    - Example: 
    ```python
    {(('A', 24, 'V'), ('R', 33, 'T')): ['Seq2'], 
    (('A', 24, 'V')): ['Seq1', 'Seq4']}
    ```

In [14]:
# Summarize all the mutations in all alignments
s,f = my_protein.representative_sequence.sequence_mutation_summary()
print('Single mutations:')
s
print('---------------------')
print('Mutation fingerprints')
f

Single mutations:


{('N', 17, 'P'): ['N17P_SNP', 'Q4S_N17P_SNP'], ('Q', 4, 'S'): ['Q4S_N17P_SNP']}

---------------------
Mutation fingerprints


{(('N', 17, 'P'),): ['N17P_SNP'],
 (('Q', 4, 'S'), ('N', 17, 'P')): ['Q4S_N17P_SNP']}

## Viewing structures

The awesome package [nglview](https://github.com/arose/nglview) is utilized as a backend for viewing structures within a Jupyter notebook. There are many more options which can be set if you run:

```python
import nglview
view = nglview.show_structure_file(my_protein.representative_structure.structure_path)
view
```

`ssbio` provides some wrapper functions to easily view structures and also map sequence residue numbers to structure residue numbers:

### Methods

#### `representative_structure.view_structure`
View the protein structure.

- `opacity`
    - Default: `1`
    - Opacity of the structure
- `gui`
    - Default: `False`
    - If the NGLview GUI should show up

In [15]:
# View just the structure
my_protein.representative_structure.view_structure()

#### `view_all_mutations`
Map all sequence alignment mutations to the structure and scale the visualization based on frequency of mutation.

- `grouped`
    - Default: `False`
    - If groups of mutations should be colored and sized together
- `color`
    - Default: `'red'`
    - Color of the mutations (overridden if unique_colors=True)
- `unique_colors`
    - Default: `True`
    - If each mutation/mutation group should be colored uniquely
- `structure_opacity`
    - Default: `0.5`
    - Opacity of the protein structure cartoon representation
- `opacity_range`
    - Default: `(0.8, 1)`
    - Min/max opacity values (mutations that show up more will be opaque)
- `scale_range`
    - Default: `(1, 5)`
    - Min/max size values (mutations that show up more will be bigger)
- `gui`
    - Default: `False`
    - If the NGLview GUI should show up

In [16]:
# Map the mutations on the visualization (scale increased)
my_protein.view_all_mutations(scale_range=(4,7))

INFO:ssbio.structure.structprop:Selection: ( :A ) and not hydrogen and 17
INFO:ssbio.structure.structprop:Selection: ( :A ) and not hydrogen and 4


#### `representative_structure.view_structure_and_highlight_residues`
View any residue on the representative chain (by the structure residue number).

<p><div class="alert alert-warning">**Warning:** Sometimes, the residue numbering of a structure doesn't match a mapped sequence. You can use the method in the [next section](#Mapping-sequence-residue-numbers-to-structure-residue-numbers) to easily map residue numbers between the representative sequence and structure. </div></p>

- `structure_resnums`
    - (int, list): Residue numbers in the structure

In [17]:
# View just the structure
my_protein.representative_structure.view_structure_and_highlight_residues(structure_resnums=[1,2,3])

INFO:ssbio.structure.structprop:Selection: ( :A ) and not hydrogen and ( 1 or 2 or 3 )


## Mapping sequence residue numbers to structure residue numbers

### Methods

#### `representative_structure.map_repseq_resnums_to_structure_resnums`
<p><div class="alert alert-info">**Note:** This method has not been implemented for other chains or other stored structures yet.</div></p>

Map a residue number in the reference_seq to the actual structure file's residue number

- `resnums`
    - (int, list): Residue numbers in the representative sequence

In [18]:
my_protein.representative_structure.map_repseq_resnums_to_structure_resnums([1,2,3,4])



{3: (' ', 3, ' '), 4: (' ', 4, ' ')}

## Saving

In [19]:
import os.path as op
my_protein.save_json(op.join(my_protein.protein_dir, 'protein.json'))

INFO:ssbio.core.io:Saved <class 'ssbio.core.protein.Protein'> (id: SRR1753782_00918) to /tmp/SRR1753782_00918/protein.json
