# README

This notebook gives an example of how to run the GEM-PRO pipeline on a **list of gene IDs** (in Python 2, but there are no differences for Python 3).

### Installation

See: https://github.com/nmih/ssbio/blob/master/README.md
- If something isn't working, make sure to update the repository before you do anything (git pull)

In [1]:
# Import the GEM-PRO class
from ssbio.pipeline.gempro import GEMPRO

In [2]:
# Create logger
import logging
logger = logging.getLogger()

############# SET YOUR LOGGING LEVEL HERE #############
logger.setLevel(logging.INFO)
#######################################################

In [3]:
# Printing multiple outputs per cell
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

## Initialization of the project

Set these three things:

- GEM_NAME
    - Your project name
- ROOT_DIR
    - The directory where the GEM_NAME folder will be created
- GENES
    - Your list of gene IDs
    
A directory will be created in ROOT_DIR named after your GEM_NAME. The folders are:
```
    .
    ├── data  # where dataframes are stored
    ├── model  # where SBML and GEM-PRO models are stored
    ├── sequences  # sequences are stored here, in gene specific folders
    │   ├── <gene_id1>
    │   │   └── sequence.fasta
    │   ├── <gene_id2>
    │   └── ...
    └── structures
        ├── by_gene  # structures are stored here, in gene specific folders
        │   ├── <gene_id1>
        │   │   └── 1abc.pdb
        │   ├── <gene_id2>
        │   └── ...
        └── by_complex  # complexes for reactions are stored here (in progress)
```

In [4]:
GEM_NAME = 'gempro_genes_and_sequences_py2_test'

ROOT_DIR = '/home/nathan/projects_unsynced/'

GENES_AND_SEQUENCES = {'b4485': 'MTTDQHQEILRTEGLSKFFPGVKALDNVDFSLRRGEIMALLGENGAGKSTLIKALTGVYHADRGTIWLEGQAISPKNTAHAQQLGIGTVYQEVNLLPNMSVADNLFIGREPKRFGLLRRKEMEKRATELMASYGFSLDVREPLNRFSVAMQQIVAICRAIDLSAKVLILDEPTASLDTQEVELLFDLMRQLRDRGVSLIFVTHFLDQVYQVSDRITVLRNGSFVGCRETCELPQIELVKMMLGRELDTHALQRAGRTLLSDKPVAAFKNYGKKGTIAPFDLEVRPGEIVGLAGLLGSGRTETAEVIFGIKPADSGTALIKGKPQNLRSPHQASVLGIGFCPEDRKTDGIIAAASVRENIILALQAQRGWLRPISRKEQQEIAERFIRQLGIRTPSTEQPIEFLSGGNQQKVLLSRWLLTRPQFLILDEPTRGIDVGAHAEIIRLIETLCADGLALLVISSELEELVGYADRVIIMRDRKQVAEIPLAELSVPAIMNAIAA',
                       'b4390': 'MSSFDYLKTAIKQQGCTLQQVADASGMTKGYLSQLLNAKIKSPSAQKLEALHRFLGLEFPRQKKTIGVVFGKFYPLHTGHIYLIQRACSQVDELHIIMGFDDTRDRALFEDSAMSQQPTVPDRLRWLLQTFKYQKNIRIHAFNEEGMEPYPHGWDVWSNGIKKFMAEKGIQPDLIYTSEEADAPQYMEHLGIETVLVDPKRTFMSISGAQIRENPFRYWEYIPTEVKPFFVRTVAILGGESSGKSTLVNKLANIFNTTSAWEYGRDYVFSHLGGDEIALQYSDYDKIALGHAQYIDFAVKYANKVAFIDTDFVTTQAFCKKYEGREHPFVQALIDEYRFDLVILLENNTPWVADGLRSLGSSVDRKEFQNLLVEMLEENNIEFVRVEEEDYDSRFLRCVELVREMMGEQR'}

In [5]:
# Create the GEM-PRO project
my_gempro = GEMPRO(gem_name=GEM_NAME, root_dir=ROOT_DIR, genes_and_sequences=GENES_AND_SEQUENCES)

INFO:ssbio.pipeline.gempro:/home/nathan/projects_unsynced/gempro_genes_and_sequences_py2_test: GEM-PRO project files location
INFO:ssbio.pipeline.gempro:Loaded in 2 sequences
INFO:ssbio.pipeline.gempro:2: number of genes


## Mapping sequence -> structure

There are two ways to map sequence to structure:
1. Simply use the UniProt ID and their automatic mappings to the PDB
2. BLAST the sequence to the PDB

### map_uniprot_to_pdb
This uses a service from the PDB to return a rank ordered list of PDBs that match a UniProt ID. 
- seq_ident_cutoff (from 0 to 1)
    - Provide the seq_ident_cutoff as a percentage to filter for structures with only a percent identity above the cutoff.
    - **Warning:** if you set the seq_ident_cutoff too high you risk filtering out PDBs that do match the sequence, but are just missing large portions of it.

### blast_seqs_to_pdb
This will BLAST the representative sequence against the entire PDB, and return significant hits. This will however return hits in other organisms, which may not be ideal.
- all_genes
    - Set to True if you want all genes and their sequences BLASTed
    - Set to False if you only want to BLAST sequences that did not have any PDBs from the function map_uniprot_to_pdb
- seq_ident_cutoff
    - Same as above
- evalue
    - Significance of BLAST results

-------------------

## Saved information:

### Gene annotations
    - my_gempro.genes.get_by_id('b1187').protein.structures

### DataFrames
    - my_gempro.df_pdb_ranking
    - my_gempro.df_pdb_blast

In [6]:
# Mapping using BLAST
my_gempro.blast_seqs_to_pdb(all_genes=True, seq_ident_cutoff=.2, evalue=0.00001)

INFO:ssbio.pipeline.gempro:b4390: adding 1 PDBs from BLAST results
INFO:ssbio.pipeline.gempro:Completed sequence --> PDB BLAST. See the "df_pdb_blast" attribute.





In [7]:
# Looking at the BLAST results
my_gempro.df_pdb_blast.head()

Unnamed: 0,gene,pdb_id,pdb_chain_id,hit_score,hit_evalue,hit_percent_similar,hit_percent_ident,hit_num_ident,hit_num_similar
0,b4390,1lw7,A,1026.0,3.66442e-111,0.634146,0.465854,191,260


## Ranking and downloading structures

### set_representative_structure
Rank available structures, run QC/QA, download and clean the final structure
### pdb_downloader_and_metadata
Download all structures per gene. This also adds some additional metadata to the annotation for each PDB.
    
-------------------

## Saved information:

### Gene annotations
    - my_gempro.genes.get_by_id('b1187').protein.representative_structure

### DataFrames
    - my_gempro.df_pdb_metadata

In [8]:
my_gempro.set_representative_structure()

INFO:ssbio.pipeline.gempro:Created representative structures dataframe. See the "df_representative_structures" attribute.





In [9]:
# Download all mapped PDBs and gather the metadata
my_gempro.pdb_downloader_and_metadata()

INFO:ssbio.pipeline.gempro:Created PDB metadata dataframe.





In [10]:
# Look at the summary of structures
my_gempro.df_pdb_metadata.head()

Unnamed: 0,gene,pdb_id,pdb_title,description,experimental_method,resolution,chemicals,date,taxonomy_name,structure_file
0,b4390,1lw7,NADR PROTEIN FROM HAEMOPHILUS INFLUENZAE,TRANSCRIPTIONAL REGULATOR NADR,X-RAY DIFFRACTION,2.9,SO4;NAD,2002-08-07;2004-02-10;2009-02-24;2011-07-13,Haemophilus influenzae,
