# README

This notebook gives an example of how to run the GEM-PRO pipeline on a **list of gene IDs** in Python 2.

### Installation

See: https://github.com/nmih/ssbio/blob/master/README.md

In [1]:
# Import the GEM-PRO class
from ssbio.gempro.pipeline import GEMPRO

## Logs

The ssbio package displays logging messages for various parts of the pipeline. Set the logging level to either of the following to be notified of messages of varying importance:
- CRITICAL
    - Only really important messages shown, although in most cases an exception is thrown
- ERROR
    - Major errors
- WARNING
    - Warnings that don't affect running of the pipeline
- INFO
    - Info such as the number of structures mapped per gene
- DEBUG
    - Really detailed information that will print out a lot of stuff

In [2]:
# Getting logs to work in jupyter
# https://github.com/ipython/ipykernel/issues/111

# Create logger
import sys
import logging
logger = logging.getLogger()
# Create STDERR handler
handler = logging.StreamHandler(sys.stderr)
# Create formatter and add it to the handler
formatter = logging.Formatter('%(name)s - %(levelname)s - %(message)s')
handler.setFormatter(formatter)
# Set STDERR handler as the only handler 
logger.handlers = [handler]
# Hide most requests messages
logging.getLogger("requests").setLevel(logging.WARNING)
logging.getLogger("urllib3").setLevel(logging.WARNING)
# Show multiple outputs of an IPython cell
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

########### SET YOUR LOGGING LEVEL HERE ###########
logger.setLevel(logging.INFO)
handler.setLevel(logging.INFO)

## Initialization of the project

Set these three things:

- GEM_NAME
    - Your project name
- ROOT_DIR
    - The directory where the GEM_NAME folder will be created
- GENES
    - Your list of gene IDs
    
A directory will be created in ROOT_DIR named after your GEM_NAME. The folders are:
```
    .
    ├── data  # where dataframes are stored
    ├── figures  # where figures are stored
    ├── model  # where SBML and GEM-PRO models are stored
    ├── notebooks  # location of any ipython notebooks for analyses
    ├── sequences  # sequences are stored here, in gene specific folders
    │   ├── <gene_id1>
    │   │   └── sequence.fasta
    │   ├── <gene_id2>
    │   └── ...
    └── structures
        ├── by_gene  # structures are stored here, in gene specific folders
        │   ├── <gene_id1>
        │   │   └── 1abc.pdb
        │   ├── <gene_id2>
        │   └── ...
        └── by_complex  # complexes for reactions are stored here (in progress)
```

In [3]:
GEM_NAME = 'py2test'

ROOT_DIR = '/home/nathan/projects_unsynced/'

GENES = ['b0761','b0889','b0995','b1013','b1014','b1040','b1130','b1187','b1221','b1299','b1323']

In [4]:
# Create the GEM-PRO project
my_gempro = GEMPRO(gem_name=GEM_NAME, root_dir=ROOT_DIR, genes_list=GENES)

ssbio.gempro.pipeline - INFO - Number of genes: 11


## Mapping gene ID -> sequence

**You only need to map gene IDs using one service.** However you can try both if some genes don't map in one service and map in another!

### kegg_mapping_and_metadata
- kegg_organism_code
    - See the full list of organisms here: http://www.genome.jp/kegg/catalog/org_list.html
    - E. coli MG1655 is "eco"

### uniprot_mapping_and_metadata
You can try mapping your genes using the actual service here:
http://www.uniprot.org/uploadlists/

- model_gene_source
    - Here is a list of the gene IDs that can be mapped to UniProt IDs: http://www.uniprot.org/help/programmatic_access#id_mapping_examples
    - E. coli b-numbers in this example are of the source "ENSEMBLGENOME_ID"

### set_representative_sequence
This function allows you to consolidate sources of ID mapping (currently KEGG and UniProt).

-------------------


## Saved information:

### Gene annotations
Check out the gene annotations saved directly into the gene. These are COBRApy Gene objects!
    - my_gempro.genes.get_by_id('b0995').annotation

### DataFrames
Check out the dataframes to see a summary of results.
    - my_gempro.df_uniprot_metadata
    - my_gempro.df_kegg_metadata
    - my_gempro.df_sequence_mapping

In [5]:
# Looking at the Gene object - empty for now
example_gene = my_gempro.genes[0]
example_gene.annotation

{'sequence': {'kegg': {'kegg_id': None,
   'metadata_file': None,
   'pdbs': [],
   'seq_file': None,
   'seq_len': 0,
   'uniprot_acc': None},
  'representative': {'kegg_id': None,
   'metadata_file': None,
   'pdbs': [],
   'seq_file': None,
   'seq_len': 0,
   'uniprot_acc': None},
  'uniprot': {}},
 'structure': {'homology': {},
  'pdb': OrderedDict(),
  'representative': {'clean_pdb_file': None,
   'original_pdb_file': None,
   'seq_coverage': 0,
   'structure_id': None}}}

In [6]:
# UniProt mapping of gene_id -> uniprot_id
my_gempro.uniprot_mapping_and_metadata(model_gene_source='ENSEMBLGENOME_ID')

root - INFO - getUserAgent: Begin
root - INFO - getUserAgent: user_agent: EBI-Sample-Client/ (services.pyc; Python 2.7.12; Linux) Python-requests/2.9.1
root - INFO - getUserAgent: End
100%|██████████| 11/11 [00:00<00:00, 2526.41it/s]
ssbio.gempro.pipeline - INFO - Created UniProt metadata dataframe. See the "df_uniprot_metadata" attribute.


In [7]:
# KEGG mapping of gene ids
my_gempro.kegg_mapping_and_metadata(kegg_organism_code='eco')

100%|██████████| 11/11 [00:00<00:00, 1323.54it/s]
ssbio.gempro.pipeline - INFO - Created KEGG metadata dataframe. See the "df_kegg_metadata" attribute.


In [8]:
# Consolidate mappings sources
my_gempro.set_representative_sequence()

ssbio.gempro.pipeline - INFO - Created sequence mapping dataframe. Inspect the "df_sequence_mapping" attribute for more info.


In [9]:
# Looking at the KEGG dataframe
my_gempro.df_kegg_metadata.head()

Unnamed: 0,gene,uniprot_acc,kegg_id,seq_len,pdbs,seq_file,metadata_file
0,b1323,P07604,eco:b1323,513,[2JHE],eco-b1323.faa,eco-b1323.kegg
1,b0995,P38684,eco:b0995,230,[1ZGZ],eco-b0995.faa,eco-b0995.kegg
2,b0761,P0A9G8,eco:b0761,262,"[1B9M, 1H9S, 1B9N, 1O7L, 1H9R]",eco-b0761.faa,eco-b0761.kegg
3,b1221,P0AF28,eco:b1221,216,"[1JE8, 1A04, 1ZG1, 1ZG5, 1RNL]",eco-b1221.faa,eco-b1221.kegg
4,b1299,P0A9U6,eco:b1299,185,,eco-b1299.faa,eco-b1299.kegg


In [10]:
# Looking at information saved per gene
my_gempro.genes.get_by_id('b0995').annotation['sequence']

{'kegg': {'kegg_id': 'eco:b0995',
  'metadata_file': 'eco-b0995.kegg',
  'pdbs': [u'1ZGZ'],
  'seq_file': 'eco-b0995.faa',
  'seq_len': 230,
  'uniprot_acc': u'P38684'},
 'representative': {'kegg_id': ['ecj:JW0980', 'eco:b0995'],
  'metadata_file': 'P38684.txt',
  'pdbs': ['1ZGZ'],
  'seq_file': 'P38684.fasta',
  'seq_len': 230,
  'uniprot_acc': u'P38684'},
 'uniprot': {u'P38684': {'description': ['TorCAD operon transcriptional regulatory protein TorR'],
   'entry_version': '2016-10-05',
   'gene': 'b0995',
   'gene_name': 'torR',
   'kegg_id': ['ecj:JW0980', 'eco:b0995'],
   'metadata_file': 'P38684.txt',
   'pdbs': ['1ZGZ'],
   'pfam': ['PF00072', 'PF00486'],
   'refseq': ['NP_415515.1', 'NC_000913.3', 'WP_001120125.1', 'NZ_LN832404.1'],
   'reviewed': True,
   'seq_file': 'P38684.fasta',
   'seq_len': 230,
   'seq_version': '1997-11-01',
   'uniprot_acc': u'P38684'}}}

## Mapping sequence -> structure

There are two ways to map sequence to structure:
1. Simply use the UniProt ID and their automatic mappings to the PDB
2. BLAST the sequence to the PDB

### map_uniprot_to_pdb
This uses a service from the PDB to return a rank ordered list of PDBs that match a UniProt ID. 
- seq_ident_cutoff (from 0 to 1)
    - Provide the seq_ident_cutoff as a percentage to filter for structures with only a percent identity above the cutoff.
    - **Warning:** if you set the seq_ident_cutoff too high you risk filtering out PDBs that do match the sequence, but are just missing large portions of it.

### blast_seqs_to_pdb
This will BLAST the representative sequence against the entire PDB, and return significant hits. This will however return hits in other organisms, which may not be ideal.
- all_genes
    - Set to True if you want all genes and their sequences BLASTed
    - Set to False if you only want to BLAST sequences that did not have any PDBs from the function map_uniprot_to_pdb
- seq_ident_cutoff
    - Same as above

-------------------

## Saved information:

### Gene annotations
    - my_gempro.genes.get_by_id('b0995').annotation['structure']['pdb']

### DataFrames
    - my_gempro.df_pdb_ranking
    - my_gempro.df_pdb_blast

In [11]:
# Mapping using the PDBe best_structures service
my_gempro.map_uniprot_to_pdb(seq_ident_cutoff=.3)

100%|██████████| 11/11 [00:03<00:00,  3.28it/s]
ssbio.gempro.pipeline - INFO - Completed UniProt -> best PDB mapping. See the "df_pdb_ranking" attribute.


In [12]:
# Mapping using BLAST
my_gempro.blast_seqs_to_pdb(all_genes=True, seq_ident_cutoff=.3)

  0%|          | 0/11 [00:00<?, ?it/s]ssbio.gempro.pipeline - INFO - b0995: Adding 28 PDBs from BLAST results.
ssbio.gempro.pipeline - INFO - b1221: Adding 1 PDBs from BLAST results.
ssbio.gempro.pipeline - INFO - b1187: Adding 11 PDBs from BLAST results.
ssbio.gempro.pipeline - INFO - b1130: Adding 32 PDBs from BLAST results.
ssbio.gempro.pipeline - INFO - b1014: Adding 11 PDBs from BLAST results.
ssbio.gempro.pipeline - INFO - b1013: Adding 2 PDBs from BLAST results.
ssbio.gempro.pipeline - INFO - b0889: Adding 25 PDBs from BLAST results.
100%|██████████| 11/11 [00:00<00:00, 180.51it/s]
ssbio.gempro.pipeline - INFO - Completed sequence --> PDB BLAST. See the "df_pdb_blast" attribute.


In [13]:
# Looking at the PDB ranking dataframe
my_gempro.df_pdb_ranking.head()

Unnamed: 0,gene,uniprot_acc,pdb_id,pdb_chain_id,experimental_method,resolution,seq_coverage,release_date,taxonomy_id,pdb_start,pdb_end,unp_start,unp_end,rank
0,b1323,P07604,2jhe,A,X-ray diffraction,2.3,0.37,2008-06-24,83333,1,190,1,190,1
1,b1323,P07604,2jhe,B,X-ray diffraction,2.3,0.37,2008-06-24,83333,1,190,1,190,2
2,b1323,P07604,2jhe,C,X-ray diffraction,2.3,0.37,2008-06-24,83333,1,190,1,190,3
3,b1323,P07604,2jhe,D,X-ray diffraction,2.3,0.37,2008-06-24,83333,1,190,1,190,4
4,b0995,P38684,1zgz,A,X-ray diffraction,1.8,0.53,2005-12-13,562,1,122,1,122,1


In [14]:
my_gempro.genes.get_by_id('b1221').annotation['structure']['representative']

{'clean_pdb_file': None,
 'original_pdb_file': None,
 'seq_coverage': 0,
 'structure_id': None}

In [15]:
# Looking at the information saved per gene
my_gempro.genes.get_by_id('b1221').annotation['structure']['pdb']

OrderedDict([(u'1a04_A',
              {'experimental_method': u'X-ray diffraction',
               'pdb_chain_id': u'A',
               'pdb_end': 215,
               'pdb_id': u'1a04',
               'pdb_start': 1,
               'rank': 1,
               'release_date': '1998-03-18',
               'resolution': 2.2,
               'seq_coverage': 0.995,
               'taxonomy_id': 562,
               'uniprot_acc': u'P0AF28',
               'unp_end': 216,
               'unp_start': 2}),
             (u'1a04_B',
              {'experimental_method': u'X-ray diffraction',
               'pdb_chain_id': u'B',
               'pdb_end': 215,
               'pdb_id': u'1a04',
               'pdb_start': 1,
               'rank': 2,
               'release_date': '1998-03-18',
               'resolution': 2.2,
               'seq_coverage': 0.995,
               'taxonomy_id': 562,
               'uniprot_acc': u'P0AF28',
               'unp_end': 216,
               'unp_start': 2})

## Ranking and downloading structures

### set_representative_structure
- In progress -- will rank and set one structure as the best one.

### pdb_downloader_and_metadata
Download structures per gene
- all_pdbs
    - Set to True if you want to download all PDBs that were mapped to your gene (respecting the seq_ident_cutoff that may have been set)
    - Set to False to download only the representative_structure
    
    
**NOTE:** Parsing the file for metadata takes a while if it is a big structure, I'll be adding a function to save the header information in the future.
    
-------------------

## Saved information:

### Gene annotations
    - my_gempro.genes.get_by_id('b0995').annotation['structure']['representative']

### DataFrames
    - my_gempro.df_pdb_metadata

In [16]:
# Download all mapped PDBs and gather the metadata
my_gempro.pdb_downloader_and_metadata(all_pdbs=True)

  0%|          | 0/11 [00:00<?, ?it/s]


KeyError: u'2jhe'

In [16]:
# Look at the summary of structures
my_gempro.df_pdb_metadata.head()

Unnamed: 0,gene,organism,experiment,resolution,chemicals,pdb_file,mmcif_file
0,b1323,ESCHERICHIA COLI,X-RAY DIFFRACTION,2.3,"[AE3, PG4, SO4]",2jhe.pdb,2jhe.cif
1,b0995,ESCHERICHIA COLI STR. K-12 SUBSTR. MG1655,X-RAY DIFFRACTION,3.3,[TBR],4b09.pdb,4b09.cif
2,b0995,Klebsiella pneumoniae,X-RAY DIFFRACTION,3.8,"[DA, DT, DC, DG, BEF, MG]",4s05.pdb,4s05.cif
3,b0995,Klebsiella pneumoniae,X-RAY DIFFRACTION,3.2,"[DA, DT, DC, DG, BEF, MG]",4s04.pdb,4s04.cif
4,b0995,Escherichia coli,X-RAY DIFFRACTION,1.8,"[SO4, GOL]",1zgz.pdb,1zgz.cif


In [17]:
# Looking at information saved per gene
my_gempro.genes.get_by_id('b0995').annotation['structure']['pdb']

{'1ys6': {'chemicals': ['CA', 'GOL'],
  'experiment': 'X-RAY DIFFRACTION',
  'gene': 'b0995',
  'mmcif_file': '1ys6.cif',
  'organism': 'Mycobacterium tuberculosis',
  'pdb_file': '1ys6.pdb',
  'resolution': '1.77'},
 '1ys7': {'chemicals': ['MG', 'ACT', 'TRS', 'GOL'],
  'experiment': 'X-RAY DIFFRACTION',
  'gene': 'b0995',
  'mmcif_file': '1ys7.cif',
  'organism': 'Mycobacterium tuberculosis',
  'pdb_file': '1ys7.pdb',
  'resolution': '1.58'},
 u'1zgz': {'chemicals': ['SO4', 'GOL'],
  'experiment': 'X-RAY DIFFRACTION',
  'gene': 'b0995',
  'mmcif_file': '1zgz.cif',
  'organism': 'Escherichia coli',
  'pdb_file': '1zgz.pdb',
  'resolution': '1.80'},
 '2gwr': {'chemicals': ['CA', 'GOL'],
  'experiment': 'X-RAY DIFFRACTION',
  'gene': 'b0995',
  'mmcif_file': '2gwr.cif',
  'organism': 'Mycobacterium tuberculosis',
  'pdb_file': '2gwr.pdb',
  'resolution': '2.100'},
 '2oqr': {'chemicals': ['ACT', 'LA', 'BME'],
  'experiment': 'X-RAY DIFFRACTION',
  'gene': 'b0995',
  'mmcif_file': '2oqr.ci