# SeqProp - Protein Sequence Properties

This notebook gives an overview the available **calculations for properties of a single protein sequence**.

<div class="alert alert-info">

**Input:** Amino acid sequence

</div>

<div class="alert alert-info">

**Output:** Amino acid sequence properties

</div>

## Imports

In [1]:
import sys
import logging
import os.path as op

In [2]:
# Import the SeqProp class
from ssbio.protein.sequence.seqprop import SeqProp

In [3]:
# Printing multiple outputs per cell
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

### Logging

Set the logging level in `logger.setLevel(logging.<LEVEL_HERE>)` to specify how verbose you want the pipeline to be. Debug is most verbose.

- `CRITICAL`
     - Only really important messages shown
- `ERROR`
     - Major errors
- `WARNING`
     - Warnings that don't affect running of the pipeline
- `INFO` (default)
     - Info such as the number of structures mapped per gene
- `DEBUG`
     - Really detailed information that will print out a lot of stuff
     
<p><div class="alert alert-warning">**Warning:** `DEBUG` mode prints out a large amount of information, especially if you have a lot of genes. This may stall your notebook!</div></p>

In [4]:
# Create logger
logger = logging.getLogger()
logger.setLevel(logging.INFO)  # SET YOUR LOGGING LEVEL HERE #

In [5]:
# Other logger stuff for Jupyter notebooks
handler = logging.StreamHandler(sys.stderr)
formatter = logging.Formatter('[%(asctime)s] [%(name)s] %(levelname)s: %(message)s', datefmt="%Y-%m-%d %H:%M")
handler.setFormatter(formatter)
logger.handlers = [handler]

## Initialization of the project

Set these two things:

- `PROTEIN_ID`
    - Your protein ID
- `PROTEIN_SEQ`
    - Your protein sequence
    

In [6]:
# SET IDS HERE
PROTEIN_ID = 'YIAJ_ECOLI'
PROTEIN_SEQ = 'MGKEVMGKKENEMAQEKERPAGSQSLFRGLMLIEILSNYPNGCPLAHLSELAGLNKSTVHRLLQGLQSCGYVTTAPAAGSYRLTTKFIAVGQKALSSLNIIHIAAPHLEALNIATGETINFSSREDDHAILIYKLEPTTGMLRTRAYIGQHMPLYCSAMGKIYMAFGHPDYVKSYWESHQHEIQPLTRNTITELPAMFDELAHIRESGAAMDREENELGVSCIAVPVFDIHGRVPYAVSISLSTSRLKQVGEKNLLKPLRETAQAISNELGFTVRDDLGAIT'

In [7]:
# Create the SeqProp object
my_seq = SeqProp(id=PROTEIN_ID, seq=PROTEIN_SEQ)

In [8]:
# Write temporary FASTA file for property calculations that require FASTA file as input
import tempfile
ROOT_DIR = tempfile.gettempdir()

my_seq.write_fasta_file(outfile=op.join(ROOT_DIR, 'tmp.fasta'), force_rerun=True)
my_seq.sequence_path

'/tmp/tmp.fasta'

## Computing and storing protein properties

A `SeqProp` object is simply an extension of the Biopython `SeqRecord` object. Global properties which describe or summarize the entire protein sequence are stored in the `annotations` attribute, while local residue-specific properties are stored in the `letter_annotations` attribute. 

### Basic global properties

In [9]:
# Global properties using the Biopython ProteinAnalysis module
my_seq.get_biopython_pepstats()
{k:v for k,v in my_seq.annotations.items() if k.endswith('-biop')}

{'amino_acids_percent-biop': {'A': 0.09219858156028368,
  'C': 0.014184397163120567,
  'D': 0.028368794326241134,
  'E': 0.07801418439716312,
  'F': 0.024822695035460994,
  'G': 0.07446808510638298,
  'H': 0.03900709219858156,
  'I': 0.07092198581560284,
  'K': 0.04609929078014184,
  'L': 0.1099290780141844,
  'M': 0.03546099290780142,
  'N': 0.03900709219858156,
  'P': 0.04609929078014184,
  'Q': 0.03546099290780142,
  'R': 0.04964539007092199,
  'S': 0.07446808510638298,
  'T': 0.06028368794326241,
  'V': 0.0425531914893617,
  'W': 0.0035460992907801418,
  'Y': 0.03546099290780142},
 'aromaticity-biop': 0.06382978723404256,
 'instability_index-biop': 46.34609929078015,
 'isoelectric_point-biop': 6.41558837890625,
 'molecular_weight-biop': 31066.304700000015,
 'monoisotopic-biop': False,
 'percent_helix_naive-biop': 0.2872340425531915,
 'percent_strand_naive-biop': 0.31560283687943264,
 'percent_turn_naive-biop': 0.23404255319148937}

In [10]:
# Global properties from the EMBOSS pepstats program
my_seq.get_emboss_pepstats()
{k:v for k,v in my_seq.annotations.items() if k.endswith('-pepstats')}

{'percent_acidic-pepstats': 0.10638,
 'percent_aliphatic-pepstats': 0.3156,
 'percent_aromatic-pepstats': 0.10284,
 'percent_basic-pepstats': 0.13475,
 'percent_charged-pepstats': 0.24112999999999998,
 'percent_non-polar-pepstats': 0.5496500000000001,
 'percent_polar-pepstats': 0.45035,
 'percent_small-pepstats': 0.47163,
 'percent_tiny-pepstats': 0.3156}

In [11]:
# Aggregation propensity - the predicted number of aggregation-prone segments on an unfolded protein sequence
my_seq.get_aggregation_propensity(outdir=ROOT_DIR, email='nmih@ucsd.edu', password='ssbiotest', cutoff_v=5, cutoff_n=5, run_amylmuts=False)
{k:v for k,v in my_seq.annotations.items() if k.endswith('-amylpred')}

{'aggprop-amylpred': 7}

In [12]:
# Kinetic folding rate - the predicted rate of folding for this protein sequence
secstruct_class = 'mixed'
my_seq.get_kinetic_folding_rate(secstruct=secstruct_class)
{k:v for k,v in my_seq.annotations.items() if k.endswith('-foldrate')}

{'kinetic_folding_rate_37.0_C-foldrate': '3.1'}

In [13]:
# Thermostability - prediction of free energy of unfolding dG from protein sequence
# Stores (dG, Keq)
my_seq.get_thermostability(at_temp=32.0)
my_seq.get_thermostability(at_temp=37.0)
my_seq.get_thermostability(at_temp=42.0)
{k:v for k,v in my_seq.annotations.items() if k.startswith('thermostability_')}

{'thermostability_32.0_C-oobatake': (-485.4540664728014, 2.22678150661948),
 'thermostability_37.0_C-oobatake': (-2126.8775952298206, 31.527746910631482),
 'thermostability_42.0_C-oobatake': (-4205.694728563369, 825.0926295027567)}