# GEM-PRO - Genes & Sequences

This notebook gives an example of how to run the GEM-PRO pipeline with a **dictionary of gene IDs and their protein sequences**.
<p>
<div class="alert alert-info">
**Input:** Dictionary of gene IDs and protein sequences
</div>
<div class="alert alert-info">
**Output:** GEM-PRO model
</div>
</p>

## Imports

In [1]:
# Import the GEM-PRO class
from ssbio.pipeline.gempro import GEMPRO

In [2]:
# Printing multiple outputs per cell
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

### Logging

Set the logging level in `logger.setLevel(logging.<LEVEL_HERE>)` to specify how verbose you want the pipeline to be. Debug is most verbose.

- `CRITICAL`
     - Only really important messages shown
- `ERROR`
     - Major errors
- `WARNING`
     - Warnings that don't affect running of the pipeline
- `INFO` (default)
     - Info such as the number of structures mapped per gene
- `DEBUG`
     - Really detailed information that will print out a lot of stuff
     
<div class="alert alert-warning">**Warning:** `DEBUG` mode prints out a large amount of information, especially if you have a lot of genes. This may stall your notebook!</div>

In [3]:
# Create logger
import logging
logger = logging.getLogger()
logger.setLevel(logging.INFO)  # SET YOUR LOGGING LEVEL HERE #

## Initialization of the project

Set these three things:

- `ROOT_DIR`
    - The directory where a folder named after your `PROJECT` will be created
- `PROJECT`
    - Your project name
- `GENES_AND_SEQUENCES`
    - Your dictionary of gene IDs and their sequence strings
    
A directory will be created in `ROOT_DIR` with your `PROJECT` name. The folders are organized like so:
```
    ROOT_DIR
    └── PROJECT
        ├── data  # General storage for pipeline outputs
        ├── model  # SBML and GEM-PRO models are stored here
        ├── genes  # Per gene information
        │   ├── <gene_id1>  # Specific gene directory
        │   │   └── protein
        │   │       ├── sequences  # Protein sequence files, alignments, etc.
        │   │       └── structures  # Protein structure files, calculations, etc.
        │   └── <gene_id2>
        │       └── protein
        │           ├── sequences
        │           └── structures
        ├── reactions  # Per reaction information
        │   └── <reaction_id1>  # Specific reaction directory
        │       └── complex
        │           └── structures  # Protein complex files
        └── metabolites  # Per metabolite information
            └── <metabolite_id1>  # Specific metabolite directory
                └── chemical
                    └── structures  # Metabolite 2D and 3D structure files
                
```

<div class="alert alert-info">**Note:** Methods for protein complexes and metabolites are still in development.</div>

In [4]:
# SET FOLDERS AND DATA HERE
import tempfile
ROOT_DIR = tempfile.gettempdir()

PROJECT = 'genes_and_sequences_GP'
GENES_AND_SEQUENCES = {'b0870': 'MIDLRSDTVTRPSRAMLEAMMAAPVGDDVYGDDPTVNALQDYAAELSGKEAAIFLPTGTQANLVALLSHCERGEEYIVGQAAHNYLFEAGGAAVLGSIQPQPIDAAADGTLPLDKVAMKIKPDDIHFARTKLLSLENTHNGKVLPREYLKEAWEFTRERNLALHVDGARIFNAVVAYGCELKEITQYCDSFTICLSKGLGTPVGSLLVGNRDYIKRAIRWRKMTGGGMRQSGILAAAGIYALKNNVARLQEDHDNAAWMAEQLREAGADVMRQDTNMLFVRVGEENAAALGEYMKARNVLINASPIVRLVTHLDVSREQLAEVAAHWRAFLAR',
                       'b3041': 'MNQTLLSSFGTPFERVENALAALREGRGVMVLDDEDRENEGDMIFPAETMTVEQMALTIRHGSGIVCLCITEDRRKQLDLPMMVENNTSAYGTGFTVTIEAAEGVTTGVSAADRITTVRAAIADGAKPSDLNRPGHVFPLRAQAGGVLTRGGHTEATIDLMTLAGFKPAGVLCELTNDDGTMARAPECIEFANKHNMALVTIEDLVAYRQAHERKAS'}

In [5]:
# Create the GEM-PRO project
my_gempro = GEMPRO(gem_name=PROJECT, root_dir=ROOT_DIR, genes_and_sequences=GENES_AND_SEQUENCES)

INFO:ssbio.pipeline.gempro:/tmp/genes_and_sequences_GP: GEM-PRO project location
INFO:ssbio.pipeline.gempro:Loaded in 2 sequences
INFO:ssbio.pipeline.gempro:2: number of genes


## Mapping sequence -> structure

Since the sequences have been provided, we just need to BLAST them to the PDB.

<p><div class="alert alert-info">**Note:** These methods do not download any 3D structure files.</div></p>

### Methods

#### `blast_seqs_to_pdb`
This will BLAST the representative sequence against the entire PDB, and return significant hits. XML files of the BLAST results are saved in the respective sequence folders for a protein.

<p><div class="alert alert-info">**Warning:** A PDB BLAST may return hits in other organisms.</div></p>

- `seq_ident_cutoff`
    - Default: `0`
    - From 0 to 1
- `evalue`
    - Default: `0.0001`
    - Significance of BLAST results
- `all_genes`
    - Default: `False`
    - Set to `True` if you want all genes and their sequences BLASTed
    - Set to `False` if you only want to BLAST sequences that did not have any PDBs mapped to them already
- `display_link`
    - Default: `False`
    - Set to `True` if you want a clickable HTML link to be printed
- `force_rerun`
    - Default: `False`
    - Set to `True` if you want to ignore any existing XML results and run the BLAST again

- What's saved?
    - Protein structures
    ```python
    my_gempro.genes.get_by_id('b0870').protein.structures
    ```
    - DataFrames
    ```python
    my_gempro.df_pdb_blast
    ```

In [6]:
# Mapping using BLAST
my_gempro.blast_seqs_to_pdb(seq_ident_cutoff=.80, evalue=0.00001)
my_gempro.df_pdb_blast.head()

INFO:ssbio.pipeline.gempro:Completed sequence --> PDB BLAST. See the "df_pdb_blast" attribute.
INFO:ssbio.pipeline.gempro:0: number of genes with additional structures added from BLAST





Unnamed: 0_level_0,pdb_id,pdb_chain_id,hit_score,hit_evalue,hit_percent_similar,hit_percent_ident,hit_num_ident,hit_num_similar
gene,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
b3041,1iez,A,1060.0,1.44231e-115,1.0,1.0,217,217
b3041,1g58,A,1050.0,2.01433e-114,0.995392,0.995392,216,216
b3041,1g58,B,1050.0,2.01433e-114,0.995392,0.995392,216,216
b3041,1g57,A,1050.0,2.01433e-114,0.995392,0.995392,216,216
b3041,1g57,B,1050.0,2.01433e-114,0.995392,0.995392,216,216


In [7]:
# Looking at the information saved within a gene
my_gempro.genes.get_by_id('b0870').protein.structures

[<PDBProp 3wlx at 0x7f4d568e0ef0>,
 <PDBProp 4rjy at 0x7f4d568e07f0>,
 <PDBProp 4lnm at 0x7f4d57a59ac8>,
 <PDBProp 4lnl at 0x7f4d56839128>,
 <PDBProp 4lnj at 0x7f4d56839240>]

## Downloading and ranking structures

### Methods

#### `pdb_downloader_and_metadata`
Download **all** structures per protein. This also adds metadata to each PDB object in the list of structures.

<p><div class="alert alert-warning">**Warning:** Don't run this if you don't need all PDB structures - just set representative structures below if you want 1 structure per protein.</div></p>

- `outdir`
    - Default: `None`
    - Set this to a custom location if you want to save PDB files outside the GEM-PRO project folder
- `pdb_file_type`
    - Default: `'cif'` (set in GEMPRO project initialization, but can be changed here)
    - `'pdb'`, `'pdb.gz'`, `'mmcif'`, `'cif'`, `'cif.gz'`, `'xml.gz'`, `'mmtf'`, `'mmtf.gz'` - File type for files downloaded from the PDB.
- `force_rerun`
    - Default: `False`
    - Set to `True` if you want to re-download PDB files.
- What's saved?
    - Additional metadata per structure
    ```python
    my_gempro.genes.get_by_id('b0870').protein.structures
    ```
    - DataFrames
    ```python
    my_gempro.df_pdb_metadata
    ```

In [8]:
# Download all mapped PDBs and gather the metadata
my_gempro.pdb_downloader_and_metadata()
my_gempro.df_pdb_metadata.head(2)

INFO:ssbio.pipeline.gempro:Updated PDB metadata dataframe. See the "df_pdb_metadata" attribute.
INFO:ssbio.pipeline.gempro:Saved 11 structures total





Unnamed: 0_level_0,pdb_id,pdb_title,description,experimental_method,resolution,chemicals,date,taxonomy_name,structure_file
gene,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
b3041,1iez,"Solution Structure of 3,4-Dihydroxy-2-Butanone...","3,4-Dihydroxy-2-Butanone 4-Phosphate Synthase ...",SOLUTION NMR,,,2001-11-07;2003-04-01;2009-02-24;2011-07-13,Escherichia coli,1iez.cif
b3041,1g58,"CRYSTAL STRUCTURE OF 3,4-DIHYDROXY-2-BUTANONE ...","3,4-DIHYDROXY-2-BUTANONE 4-PHOSPHATE SYNTHASE",X-RAY DIFFRACTION,1.55,AU,2001-04-30;2003-04-01;2009-02-24,Escherichia coli,1g58.cif


#### `set_representative_structure`
Rank available structures, run QC/QA, download and clean the final structure.

<p><div class="alert alert-info">**Note:** PDBs don't need to be downloaded before running this step. This is useful to limit the number of structures downloaded from the PDB.</div></p>

- `pdb_file_type`
    - Default: `'cif'` (set in GEMPRO project initialization, but can be changed here)
    - `'pdb'`, `'pdb.gz'`, `'mmcif'`, `'cif'`, `'cif.gz'`, `'xml.gz'`, `'mmtf'`, `'mmtf.gz'` - File type for files downloaded from the PDB.
- `engine`
    - Default: `'needle'`
    - Set to `'biopython'` if you want to utilize Biopython's built-in pairwise alignment algorithm.
- `always_use_homology`
    - Default: `False`
    - Set to `True` if you always want to use homology models.
- `seq_ident_cutoff`
    - Default: `0.5`
    - QC/QA: sets the minimum sequence identity a structure has to have to be selected as representative.
- `allow_missing_on_termini`
    - Default: `0.2`
    - QC/QA: Percentage of the total length of the reference sequence which will be ignored when checking for modifications (mutations, deletions, insertions, or unresolved residues). Example: if `0.1`, and reference sequence is 100 AA, then only residues 5 to 95 will be checked for modifications.
- `allow_mutants`
    - Default: `True`
    - QC/QA: set to `True` if point mutations within the structure should be allowed.
- `allow_deletions`
    - Default: `False`
    - QC/QA: set to `True` if deletions within the structure should be allowed.
- `allow_insertions`
    - Default: `False`
    - QC/QA: set to `True` if insertions within the structure should be allowed.
- `allow_unresolved`
    - Default: `True`
    - QC/QA: set to `True` if unresolved regions within the structure should be allowed.
- `force_rerun`
    - Default: `False`
    - QC/QA: set to `True` if pairwise alignments and structure cleaning should be rerun even if files exist.
- What's saved?
    - Representative protein structures
    ```python
    my_gempro.genes.get_by_id('b0870').protein.representative_structure
    ```
    - DataFrames
    ```python
    my_gempro.df_representative_structures
    ```

In [9]:
# Set representative structures
my_gempro.set_representative_structure()
my_gempro.df_representative_structures.head()

INFO:ssbio.pipeline.gempro:Created representative structures dataframe. See the "df_representative_structures" attribute.
INFO:ssbio.pipeline.gempro:2/2: genes with a representative structure





Unnamed: 0_level_0,id,is_experimental,reference_seq,reference_seq_top_coverage
gene,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
b3041,1iez-A,True,b3041,100.0
b0870,3wlx-A,True,b0870,99.4


In [10]:
# Looking at the information saved within a gene
my_gempro.genes.get_by_id('b0870').protein.representative_structure
my_gempro.genes.get_by_id('b0870').protein.representative_structure.get_dict()

<StructProp 3wlx-A at 0x7f4d548bf1d0>

{'_structure_dir': '/tmp/genes_and_sequences_GP/genes/b0870/b0870_protein/structures',
 'chains': [<ChainProp A at 0x7f4d55cd24a8>],
 'date': '2014-12-17',
 'description': 'Low specificity L-threonine aldolase (E.C.4.1.2.48)',
 'file_type': 'pdb',
 'id': '3wlx-A',
 'is_experimental': True,
 'mapped_chains': ['A'],
 'original_pdb_id': '3wlx',
 'reference_seq': <SeqProp b0870 at 0x7f4d55cd2dd8>,
 'reference_seq_top_coverage': 99.4,
 'representative_chain': <ChainProp A at 0x7f4d55cd2cf8>,
 'resolution': 2.51,
 'structure_file': '3wlx-A_clean.pdb',
 'taxonomy_name': 'Escherichia coli'}