# README

This notebook gives an example of how to run the GEM-PRO pipeline on a **list of gene IDs** (in Python 2, but there are no differences for Python 3).

### Installation

See: https://github.com/nmih/ssbio/blob/master/README.md
- If something isn't working, make sure to update the repository before you do anything (git pull)

In [1]:
# Import the GEM-PRO class
from ssbio.pipeline.gempro import GEMPRO

In [2]:
# Create logger
import logging
logger = logging.getLogger()

############# SET YOUR LOGGING LEVEL HERE #############
# - CRITICAL
#     - Only really important messages shown
# - ERROR
#     - Major errors
# - WARNING
#     - Warnings that don't affect running of the pipeline
# - INFO
#     - Info such as the number of structures mapped per gene
# - DEBUG
#     - Really detailed information that will print out a lot of stuff
logger.setLevel(logging.INFO)
#######################################################

## Initialization of the project

Set these three things:

- GEM_NAME
    - Your project name
- ROOT_DIR
    - The directory where the GEM_NAME folder will be created
- GENES
    - Your list of gene IDs
    
A directory will be created in ROOT_DIR named after your GEM_NAME. The folders are:
```
    .
    ├── data  # where dataframes are stored
    ├── figures  # where figures are stored
    ├── model  # where SBML and GEM-PRO models are stored
    ├── notebooks  # location of any ipython notebooks for analyses
    ├── sequences  # sequences are stored here, in gene specific folders
    │   ├── <gene_id1>
    │   │   └── sequence.fasta
    │   ├── <gene_id2>
    │   └── ...
    └── structures
        ├── by_gene  # structures are stored here, in gene specific folders
        │   ├── <gene_id1>
        │   │   └── 1abc.pdb
        │   ├── <gene_id2>
        │   └── ...
        └── by_complex  # complexes for reactions are stored here (in progress)
```

In [3]:
GEM_NAME = 'gempro_genes_and_sequences_py2_test'

ROOT_DIR = '/home/nathan/projects_unsynced/'

GENES_AND_SEQUENCES = {'b4485': 'MTTDQHQEILRTEGLSKFFPGVKALDNVDFSLRRGEIMALLGENGAGKSTLIKALTGVYHADRGTIWLEGQAISPKNTAHAQQLGIGTVYQEVNLLPNMSVADNLFIGREPKRFGLLRRKEMEKRATELMASYGFSLDVREPLNRFSVAMQQIVAICRAIDLSAKVLILDEPTASLDTQEVELLFDLMRQLRDRGVSLIFVTHFLDQVYQVSDRITVLRNGSFVGCRETCELPQIELVKMMLGRELDTHALQRAGRTLLSDKPVAAFKNYGKKGTIAPFDLEVRPGEIVGLAGLLGSGRTETAEVIFGIKPADSGTALIKGKPQNLRSPHQASVLGIGFCPEDRKTDGIIAAASVRENIILALQAQRGWLRPISRKEQQEIAERFIRQLGIRTPSTEQPIEFLSGGNQQKVLLSRWLLTRPQFLILDEPTRGIDVGAHAEIIRLIETLCADGLALLVISSELEELVGYADRVIIMRDRKQVAEIPLAELSVPAIMNAIAA',
                       'b4390': 'MSSFDYLKTAIKQQGCTLQQVADASGMTKGYLSQLLNAKIKSPSAQKLEALHRFLGLEFPRQKKTIGVVFGKFYPLHTGHIYLIQRACSQVDELHIIMGFDDTRDRALFEDSAMSQQPTVPDRLRWLLQTFKYQKNIRIHAFNEEGMEPYPHGWDVWSNGIKKFMAEKGIQPDLIYTSEEADAPQYMEHLGIETVLVDPKRTFMSISGAQIRENPFRYWEYIPTEVKPFFVRTVAILGGESSGKSTLVNKLANIFNTTSAWEYGRDYVFSHLGGDEIALQYSDYDKIALGHAQYIDFAVKYANKVAFIDTDFVTTQAFCKKYEGREHPFVQALIDEYRFDLVILLENNTPWVADGLRSLGSSVDRKEFQNLLVEMLEENNIEFVRVEEEDYDSRFLRCVELVREMMGEQR'}

In [4]:
# Create the GEM-PRO project
my_gempro = GEMPRO(gem_name=GEM_NAME, root_dir=ROOT_DIR, genes_and_sequences=GENES_AND_SEQUENCES)

INFO:ssbio.pipeline.gempro:Loaded in 2 sequences


In [5]:
# Looking at the Gene object - empty for now
example_gene = my_gempro.genes[0]
example_gene.annotation

{'sequence': {'kegg': {'kegg_id': None,
   'metadata_file': None,
   'pdbs': [],
   'seq_file': None,
   'seq_len': 0,
   'uniprot_acc': None},
  'representative': {'kegg_id': None,
   'metadata_file': None,
   'pdbs': [],
   'properties': {},
   'seq_file': 'b4485.faa',
   'seq_len': 500,
   'uniprot_acc': None},
  'uniprot': {}},
 'structure': {'homology': {},
  'pdb': OrderedDict(),
  'representative': {'clean_pdb_file': None,
   'original_pdb_file': None,
   'seq_coverage': 0,
   'structure_id': None}}}

## Mapping sequence -> structure

There are two ways to map sequence to structure:
1. Simply use the UniProt ID and their automatic mappings to the PDB
2. BLAST the sequence to the PDB

### map_uniprot_to_pdb
This uses a service from the PDB to return a rank ordered list of PDBs that match a UniProt ID. 
- seq_ident_cutoff (from 0 to 1)
    - Provide the seq_ident_cutoff as a percentage to filter for structures with only a percent identity above the cutoff.
    - **Warning:** if you set the seq_ident_cutoff too high you risk filtering out PDBs that do match the sequence, but are just missing large portions of it.

### blast_seqs_to_pdb
This will BLAST the representative sequence against the entire PDB, and return significant hits. This will however return hits in other organisms, which may not be ideal.
- all_genes
    - Set to True if you want all genes and their sequences BLASTed
    - Set to False if you only want to BLAST sequences that did not have any PDBs from the function map_uniprot_to_pdb
- seq_ident_cutoff
    - Same as above
- evalue
    - Significance of BLAST results

-------------------

## Saved information:

### Gene annotations
    - my_gempro.genes.get_by_id('b1187').annotation['structure']['pdb']

### DataFrames
    - my_gempro.df_pdb_ranking
    - my_gempro.df_pdb_blast

In [6]:
# Mapping using BLAST
my_gempro.blast_seqs_to_pdb(all_genes=True, seq_ident_cutoff=.1, evalue=0.00001)

  0%|          | 0/2 [00:00<?, ?it/s]INFO:ssbio.pipeline.gempro:b4485: Adding 383 PDBs from BLAST results.
INFO:ssbio.pipeline.gempro:b4390: Adding 1 PDBs from BLAST results.
100%|██████████| 2/2 [00:00<00:00, 162.61it/s]
INFO:ssbio.pipeline.gempro:Completed sequence --> PDB BLAST. See the "df_pdb_blast" attribute.


In [7]:
# Looking at the BLAST results
my_gempro.df_pdb_blast.head()

Unnamed: 0,gene,pdb_id,pdb_chain_id,blast_score,blast_evalue,seq_coverage,seq_similar,seq_num_coverage,seq_num_similar
0,b4485,1z47,A,299.0,9.62942e-27,0.17,0.286,85,143
1,b4485,1z47,B,299.0,9.62942e-27,0.17,0.286,85,143
2,b4485,1g6h,A,289.0,1.36747e-25,0.142,0.232,71,116
3,b4485,4u02,A,287.0,2.128e-25,0.132,0.238,66,119
4,b4485,4u02,B,287.0,2.128e-25,0.132,0.238,66,119


## Ranking and downloading structures

### set_representative_structure
Rank available structures, run QC/QA, download and clean the final structure
### pdb_downloader_and_metadata
Download all structures per gene. This also adds some additional metadata to the annotation for each PDB.
    
-------------------

## Saved information:

### Gene annotations
    - my_gempro.genes.get_by_id('b1187').annotation['structure']['representative']

### DataFrames
    - my_gempro.df_pdb_metadata

In [8]:
my_gempro.set_representative_structure()

100%|██████████| 2/2 [05:33<00:00, 233.40s/it]


In [9]:
# Download all mapped PDBs and gather the metadata
my_gempro.pdb_downloader_and_metadata()

  0%|          | 0/2 [00:00<?, ?it/s]


KeyError: 'taxonomy_name'

In [None]:
# Look at the summary of structures
my_gempro.df_pdb_metadata.head()

In [11]:
for g in my_gempro.genes:
    print(g.id, g.annotation['structure'])

('b4485', {'homology': {}, 'pdb': OrderedDict([(('1z47', 'A'), {'seq_similar': 0.286, 'pdb_id': '1z47', 'seq_coverage': 0.17, 'pdb_file': '1z47.pdb', 'seq_num_similar': 143, 'taxonomy_name': 'Alicyclobacillus acidocaldarius', 'blast_evalue': 9.62942e-27, 'blast_score': 299.0, 'pdb_chain_id': 'A', 'seq_num_coverage': 85, 'chemicals': ['CL'], 'mmcif_header': '1z47.header.cif', 'resolution': 1.9, 'experimental_method': 'X-RAY DIFFRACTION'}), (('1z47', 'B'), {'seq_similar': 0.286, 'pdb_id': '1z47', 'seq_coverage': 0.17, 'pdb_file': '1z47.pdb', 'seq_num_similar': 143, 'taxonomy_name': 'Alicyclobacillus acidocaldarius', 'blast_evalue': 9.62942e-27, 'blast_score': 299.0, 'pdb_chain_id': 'B', 'seq_num_coverage': 85, 'chemicals': ['CL'], 'mmcif_header': '1z47.header.cif', 'resolution': 1.9, 'experimental_method': 'X-RAY DIFFRACTION'}), (('1g6h', 'A'), {'seq_similar': 0.232, 'pdb_id': '1g6h', 'seq_coverage': 0.142, 'pdb_file': '1g6h.pdb', 'seq_num_similar': 116, 'blast_evalue': 1.36747e-25, 'bla