# GEM-PRO - Calculating Protein Properties
This notebook gives an example of how to **calculate protein properties** for a list of proteins, by first pulling information from UniProt and then using the 3D structure files to calculate the desired properties

<div class="alert alert-info">

**Input:** List of gene IDs

</div>

<div class="alert alert-info">

**Output:** Representative protein structures and properties associated with them

</div>

## Imports

In [1]:
import sys
import logging

In [2]:
# Import the GEM-PRO class
from ssbio.pipeline.gempro import GEMPRO

In [3]:
# Printing multiple outputs per cell
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

### Logging

Set the logging level in `logger.setLevel(logging.<LEVEL_HERE>)` to specify how verbose you want the pipeline to be. Debug is most verbose.

- `CRITICAL`
     - Only really important messages shown
- `ERROR`
     - Major errors
- `WARNING`
     - Warnings that don't affect running of the pipeline
- `INFO` (default)
     - Info such as the number of structures mapped per gene
- `DEBUG`
     - Really detailed information that will print out a lot of stuff
     
<div class="alert alert-warning">

**Warning:** 
`DEBUG` mode prints out a large amount of information, especially if you have a lot of genes. This may stall your notebook!
</div>

In [4]:
# Create logger
logger = logging.getLogger()
logger.setLevel(logging.INFO)  # SET YOUR LOGGING LEVEL HERE #

In [5]:
# Other logger stuff for Jupyter notebooks
handler = logging.StreamHandler(sys.stderr)
formatter = logging.Formatter('[%(asctime)s] [%(name)s] %(levelname)s: %(message)s', datefmt="%Y-%m-%d %H:%M")
handler.setFormatter(formatter)
logger.handlers = [handler]

## Initialization of the project

Set these three things:

- `ROOT_DIR`
    - The directory where a folder named after your `PROJECT` will be created
- `PROJECT`
    - Your project name
- `LIST_OF_GENES`
    - Your list of gene IDs
    
A directory will be created in `ROOT_DIR` with your `PROJECT` name. The folders are organized like so:
```
    ROOT_DIR
    └── PROJECT
        ├── data  # General storage for pipeline outputs
        ├── model  # SBML and GEM-PRO models are stored here
        ├── genes  # Per gene information
        │   ├── <gene_id1>  # Specific gene directory
        │   │   └── protein
        │   │       ├── sequences  # Protein sequence files, alignments, etc.
        │   │       └── structures  # Protein structure files, calculations, etc.
        │   └── <gene_id2>
        │       └── protein
        │           ├── sequences
        │           └── structures
        ├── reactions  # Per reaction information
        │   └── <reaction_id1>  # Specific reaction directory
        │       └── complex
        │           └── structures  # Protein complex files
        └── metabolites  # Per metabolite information
            └── <metabolite_id1>  # Specific metabolite directory
                └── chemical
                    └── structures  # Metabolite 2D and 3D structure files
                
```

<div class="alert alert-info">**Note:** Methods for protein complexes and metabolites are still in development.</div>

In [6]:
# SET FOLDERS AND DATA HERE
import tempfile
ROOT_DIR = tempfile.gettempdir()

PROJECT = 'genes_GP'
LIST_OF_GENES = ['b1276', 'b0118']

In [7]:
# Create the GEM-PRO project
my_gempro = GEMPRO(gem_name=PROJECT, root_dir=ROOT_DIR, genes_list=LIST_OF_GENES, pdb_file_type='pdb')

[2017-04-29 10:48] [ssbio.pipeline.gempro] INFO: /tmp/genes_GP: GEM-PRO project location
[2017-04-29 10:48] [ssbio.pipeline.gempro] INFO: 2: number of genes


## Mapping gene ID --> sequence

First, we need to map these IDs to their protein sequences. There are 2 ID mapping services provided to do this - through **KEGG** or **UniProt**. The end goal is to map a UniProt ID to each ID, since there is a comprehensive mapping (and some useful APIs) between UniProt and the PDB.

<p><div class="alert alert-info">**Note:** You only need to map gene IDs using one service. However you can run both if some genes don't map in one service and do map in another!</div></p>

### Methods

In [8]:
# UniProt mapping
my_gempro.uniprot_mapping_and_metadata(model_gene_source='ENSEMBLGENOME_ID')
print('Missing UniProt mapping: ', my_gempro.missing_uniprot_mapping)
my_gempro.df_uniprot_metadata.head()

[2017-04-29 10:48] [root] INFO: getUserAgent: Begin
[2017-04-29 10:48] [root] INFO: getUserAgent: user_agent: EBI-Sample-Client/ (services.py; Python 3.5.2; Linux) Python-requests/2.12.4
[2017-04-29 10:48] [root] INFO: getUserAgent: End
[2017-04-29 10:48] [ssbio.pipeline.gempro] INFO: 2/2: number of genes mapped to UniProt
[2017-04-29 10:48] [ssbio.pipeline.gempro] INFO: Completed ID mapping --> UniProt. See the "df_uniprot_metadata" attribute for a summary dataframe.



Missing UniProt mapping:  []


Unnamed: 0_level_0,uniprot,reviewed,num_pdbs,seq_len,description,sequence_file,metadata_file
gene,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
b0118,P36683,False,0,865,Aconitate hydratase B,P36683.fasta,P36683.xml
b1276,P25516,False,0,891,Aconitate hydratase A,P25516.fasta,P25516.xml


In [9]:
# Set representative sequences
my_gempro.set_representative_sequence()
print('Missing a representative sequence: ', my_gempro.missing_representative_sequence)
my_gempro.df_representative_sequences.head()

[2017-04-29 10:48] [ssbio.pipeline.gempro] INFO: 2/2: number of genes with a representative sequence
[2017-04-29 10:48] [ssbio.pipeline.gempro] INFO: See the "df_representative_sequences" attribute for a summary dataframe.



Missing a representative sequence:  []


Unnamed: 0_level_0,uniprot,num_pdbs,seq_len,sequence_file,metadata_file
gene,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
b0118,P36683,0,865,P36683.fasta,P36683.xml
b1276,P25516,0,891,P25516.fasta,P25516.xml


## Mapping representative sequence --> structure

These are the ways to map sequence to structure:

1. Use the UniProt ID and their automatic mappings to the PDB
2. BLAST the sequence to the PDB
3. Make homology models or 
4. Map to existing homology models

You can only utilize option #1 to map to PDBs if there is a mapped UniProt ID set in the representative sequence. If not, you'll have to BLAST your sequence to the PDB or make a homology model. You can also run both for maximum coverage.

### Methods

In [10]:
# Mapping using the PDBe best_structures service
my_gempro.map_uniprot_to_pdb(seq_ident_cutoff=.3)
my_gempro.df_pdb_ranking.head()

[2017-04-29 10:48] [ssbio.pipeline.gempro] INFO: Mapping UniProt IDs --> PDB IDs...
[2017-04-29 10:48] [root] INFO: getUserAgent: Begin
[2017-04-29 10:48] [root] INFO: getUserAgent: user_agent: EBI-Sample-Client/ (services.py; Python 3.5.2; Linux) Python-requests/2.12.4
[2017-04-29 10:48] [root] INFO: getUserAgent: End
[2017-04-29 10:48] [ssbio.pipeline.gempro] INFO: 1/2: number of genes with at least one experimental structure
[2017-04-29 10:48] [ssbio.pipeline.gempro] INFO: Completed UniProt --> best PDB mapping. See the "df_pdb_ranking" attribute for a summary dataframe.





Unnamed: 0_level_0,pdb_id,pdb_chain_id,uniprot,experimental_method,resolution,coverage,start,end,unp_start,unp_end,rank
gene,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
b0118,1l5j,A,P36683,X-ray diffraction,2.4,1,1,865,1,865,1
b0118,1l5j,B,P36683,X-ray diffraction,2.4,1,1,865,1,865,2


In [11]:
# Mapping using BLAST
my_gempro.blast_seqs_to_pdb(all_genes=True, seq_ident_cutoff=.9, evalue=0.00001)
my_gempro.df_pdb_blast.head(2)

[2017-04-29 10:48] [ssbio.pipeline.gempro] INFO: Completed sequence --> PDB BLAST. See the "df_pdb_blast" attribute for a summary dataframe.
[2017-04-29 10:48] [ssbio.pipeline.gempro] INFO: 0: number of genes with additional structures added from BLAST





In [12]:
import pandas as pd
import os.path as op

In [13]:
# Creating manual mapping dictionary for ECOLI I-TASSER models
homology_models = '/home/nathan/projects_archive/homology_models/ECOLI/zhang/'
homology_models_df = pd.read_csv('/home/nathan/projects_archive/homology_models/ECOLI/zhang_data/160804-ZHANG_INFO.csv')
tmp = homology_models_df[['zhang_id','model_file','m_gene']].drop_duplicates()
tmp = tmp[pd.notnull(tmp.m_gene)]

homology_model_dict = {}

for i,r in tmp.iterrows():
    homology_model_dict[r['m_gene']] = {r['zhang_id']: {'model_file':op.join(homology_models, r['model_file']),
                                                        'file_type':'pdb'}}
    
my_gempro.get_manual_homology_models(homology_model_dict)

[2017-04-29 10:48] [ssbio.pipeline.gempro] INFO: Updated homology model information for 2 genes.





In [14]:
# Creating manual mapping dictionary for ECOLI SUNPRO models
homology_models = '/home/nathan/projects_archive/homology_models/ECOLI/sunpro/'
homology_models_df = pd.read_csv('/home/nathan/projects_archive/homology_models/ECOLI/sunpro_data/160609-SUNPRO_INFO.csv')
tmp = homology_models_df[['sunpro_id','model_file','m_gene']].drop_duplicates()
tmp = tmp[pd.notnull(tmp.m_gene)]

homology_model_dict = {}

for i,r in tmp.iterrows():
    homology_model_dict[r['m_gene']] = {r['sunpro_id']: {'model_file':op.join(homology_models, r['model_file']),
                                                         'file_type':'pdb'}}
    
my_gempro.get_manual_homology_models(homology_model_dict)

[2017-04-29 10:48] [ssbio.pipeline.gempro] INFO: Updated homology model information for 2 genes.





## Downloading and ranking structures

### Methods

<div class="alert alert-warning">

**Warning:** 
Downloading all PDBs takes a while, since they are also parsed for metadata. You can skip this step and just set representative structures below if you want to minimize the number of PDBs downloaded.

</div>

In [15]:
# Download all mapped PDBs and gather the metadata
my_gempro.pdb_downloader_and_metadata()
my_gempro.df_pdb_metadata.head(2)

[2017-04-29 10:49] [ssbio.pipeline.gempro] INFO: Updated PDB metadata dataframe. See the "df_pdb_metadata" attribute for a summary dataframe.
[2017-04-29 10:49] [ssbio.pipeline.gempro] INFO: Saved 1 structures total





Unnamed: 0_level_0,pdb_id,pdb_title,description,experimental_method,mapped_chains,resolution,chemicals,date,taxonomy_name,structure_file
gene,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
b0118,1l5j,CRYSTAL STRUCTURE OF E. COLI ACONITASE B.,Aconitate hydratase 2 (E.C.4.2.1.3),X-ray diffraction,A;B,2.4,TRA;F3S,2002-06-12;2003-04-01;2009-02-24,Escherichia coli,1l5j.pdb


In [16]:
# Set representative structures
my_gempro.set_representative_structure()
my_gempro.df_representative_structures.head()

[2017-04-29 10:49] [ssbio.pipeline.gempro] INFO: 2/2: number of genes with a representative structure
[2017-04-29 10:49] [ssbio.pipeline.gempro] INFO: See the "df_representative_structures" attribute for a summary dataframe.





Unnamed: 0_level_0,id,is_experimental,reference_seq_top_coverage,structure_file
gene,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
b0118,1l5j-A,True,99.7,1l5j-A_clean.pdb
b1276,ACON1_ECOLI-X,False,100.0,ACON1_ECOLI_model1_clean-X_clean.pdb


### Compiuting sequence and structure properties

In [17]:
# Requires EMBOSS "pepstats" program
# See the ssbio wiki for more information: https://github.com/SBRG/ssbio/wiki/Software-Installations
# Install using:
# sudo apt-get install emboss
my_gempro.get_sequence_properties()




In [18]:
# Requires SCRATCH installation, replace path_to_scratch with own path to script
# See the ssbio wiki for more information: https://github.com/SBRG/ssbio/wiki/Software-Installations
my_gempro.get_scratch_predictions(path_to_scratch='/home/nathan/software/SCRATCH-1D_1.1/bin/run_SCRATCH-1D_predictors.sh', results_dir=my_gempro.data_dir)

[2017-04-29 10:49] [ssbio.pipeline.gempro] INFO: /tmp/genes_GP/data/genes_GP_cds.faa: wrote all representative sequences to file
[2017-04-29 10:49] [ssbio.pipeline.gempro] INFO: 2/2: number of genes with SCRATCH predictions loaded





In [19]:
my_gempro.get_disulfide_bridges()




In [20]:
# Requires DSSP installation
# See the ssbio wiki for more information: https://github.com/SBRG/ssbio/wiki/Software-Installations
my_gempro.get_dssp_annotations()




In [21]:
# Requires MSMS installation
# See the ssbio wiki for more information: https://github.com/SBRG/ssbio/wiki/Software-Installations
my_gempro.get_msms_annotations()




### Extracting residue-level properties for a list of residue numbers

In [22]:
from Bio.SeqFeature import SeqFeature, FeatureLocation

##### Looking at one protein

In [23]:
my_protein = my_gempro.genes.b0118.protein

In [24]:
# Here are the features stored for the sequence, parsed from UniProt
my_protein.representative_sequence.seq_record.features

[SeqFeature(FeatureLocation(ExactPosition(0), ExactPosition(865)), type='chain', id='PRO_0000076675'),
 SeqFeature(FeatureLocation(ExactPosition(243), ExactPosition(246)), type='region of interest'),
 SeqFeature(FeatureLocation(ExactPosition(413), ExactPosition(416)), type='region of interest'),
 SeqFeature(FeatureLocation(ExactPosition(709), ExactPosition(710)), type='metal ion-binding site'),
 SeqFeature(FeatureLocation(ExactPosition(768), ExactPosition(769)), type='metal ion-binding site'),
 SeqFeature(FeatureLocation(ExactPosition(771), ExactPosition(772)), type='metal ion-binding site'),
 SeqFeature(FeatureLocation(ExactPosition(190), ExactPosition(191)), type='binding site'),
 SeqFeature(FeatureLocation(ExactPosition(497), ExactPosition(498)), type='binding site'),
 SeqFeature(FeatureLocation(ExactPosition(790), ExactPosition(791)), type='binding site'),
 SeqFeature(FeatureLocation(ExactPosition(795), ExactPosition(796)), type='binding site'),
 SeqFeature(FeatureLocation(ExactPos

In [28]:
# Gathering properties for the metal binding sites

# Saving the structure residue numbering for visualization
metal_binding_structure_residues = []

for f in my_protein.representative_sequence.seq_record.features:
    if 'metal' in f.type.lower():
        print(f)
        
        # Get sequence properties
        sr = f.extract(my_protein.representative_sequence.seq_record)
        print('**Residue-level properties calculated or predicted from sequence:')
        print(sr.letter_annotations)
        
        # Get structure properties
        new_f = SeqFeature(FeatureLocation(int(sr.letter_annotations['repchain_resnums'][0]-1), int(sr.letter_annotations['repchain_resnums'][-1])))
        sr_st = new_f.extract(my_protein.representative_structure.representative_chain.seq_record)
        print('**Residue-level properties calculated from structure:')
        print(sr_st.letter_annotations)
        print('--------------------------------')
        
        # Get structure residue numbers
        res = my_protein.representative_structure.map_repseq_resnums_to_structure_resnums(my_protein.representative_sequence,
                                                                                          resnums=list(range(f.location.start+1, f.location.end+1)))
        metal_binding_structure_residues.append(res[f.location.start+1][1])

type: metal ion-binding site
location: [709:710]
qualifiers:
    Key: description, Value: Iron-sulfur (4Fe-4S)
    Key: evidence, Value: 5
    Key: type, Value: metal ion-binding site

**Residue-level properties calculated or predicted from sequence:
{'SS-sspro': 'C', 'SS-sspro8': 'T', 'repchain_resnums': [710.0], 'RSA-accpro': '-', 'RSA-accpro20': [10]}
**Residue-level properties calculated from structure:
{'RES_DEPTH-msms': [10.009108769044623], 'RSA-dssp': [0.11851851851851852], 'PHI-dssp': [-67.099999999999994], 'PSI-dssp': [-7.2000000000000002], 'SS-dssp': ['T'], 'structure_resnums': [(' ', 710, ' ')], 'CA_DEPTH-msms': [10.148959936792412], 'ASA-dssp': [16.0]}
--------------------------------
type: metal ion-binding site
location: [768:769]
qualifiers:
    Key: description, Value: Iron-sulfur (4Fe-4S)
    Key: evidence, Value: 5
    Key: type, Value: metal ion-binding site

**Residue-level properties calculated or predicted from sequence:
{'SS-sspro': 'C', 'SS-sspro8': 'C', 'repch

### Visualizing residues

In [29]:
metal_binding_structure_residues

[710, 769, 772]

In [30]:
my_protein.representative_structure.view_structure_and_highlight_residues(metal_binding_structure_residues)

[2017-04-29 10:50] [ssbio.protein.structure.structprop] INFO: Selection: ( :A ) and not hydrogen and ( 769 or 772 or 710 )
