# README

This notebook gives an example of how to run the GEM-PRO pipeline on a **list of gene IDs** in Python 2.

### Installation

See: https://github.com/nmih/ssbio/blob/master/README.md
- If something isn't working, make sure to update the repository before you do anything (git pull)

In [1]:
# Import the GEM-PRO class
from ssbio.pipeline.gempro import GEMPRO

In [2]:
# Create logger
import logging
logger = logging.getLogger()

############# SET YOUR LOGGING LEVEL HERE #############
# - CRITICAL
#     - Only really important messages shown
# - ERROR
#     - Major errors
# - WARNING
#     - Warnings that don't affect running of the pipeline
# - INFO
#     - Info such as the number of structures mapped per gene
# - DEBUG
#     - Really detailed information that will print out a lot of stuff
logger.setLevel(logging.INFO)
#######################################################

## Initialization of the project

Set these three things:

- GEM_NAME
    - Your project name
- ROOT_DIR
    - The directory where the GEM_NAME folder will be created
- GENES
    - Your list of gene IDs
    
A directory will be created in ROOT_DIR named after your GEM_NAME. The folders are:
```
    .
    ├── data  # where dataframes are stored
    ├── figures  # where figures are stored
    ├── model  # where SBML and GEM-PRO models are stored
    ├── notebooks  # location of any ipython notebooks for analyses
    ├── sequences  # sequences are stored here, in gene specific folders
    │   ├── <gene_id1>
    │   │   └── sequence.fasta
    │   ├── <gene_id2>
    │   └── ...
    └── structures
        ├── by_gene  # structures are stored here, in gene specific folders
        │   ├── <gene_id1>
        │   │   └── 1abc.pdb
        │   ├── <gene_id2>
        │   └── ...
        └── by_complex  # complexes for reactions are stored here (in progress)
```

In [3]:
GEM_NAME = 'py2test'

ROOT_DIR = '/home/nathan/projects_unsynced/'

GENES = ['b0761','b0889','b0995','b1013','b1014','b1040','b1130','b1187','b1221','b1299','b1323']

In [4]:
# Create the GEM-PRO project
my_gempro = GEMPRO(gem_name=GEM_NAME, root_dir=ROOT_DIR, genes_list=GENES)

INFO:ssbio.pipeline.gempro:Number of genes: 11


## Mapping gene ID -> sequence

**You only need to map gene IDs using one service.** However you can try both if some genes don't map in one service and map in another!

### kegg_mapping_and_metadata
- kegg_organism_code
    - See the full list of organisms here: http://www.genome.jp/kegg/catalog/org_list.html
    - E. coli MG1655 is "eco"

### uniprot_mapping_and_metadata
You can try mapping your genes using the actual service here:
http://www.uniprot.org/uploadlists/

- model_gene_source
    - Here is a list of the gene IDs that can be mapped to UniProt IDs: http://www.uniprot.org/help/programmatic_access#id_mapping_examples
    - E. coli b-numbers in this example are of the source "ENSEMBLGENOME_ID"

### set_representative_sequence
This function allows you to consolidate sources of ID mapping (currently KEGG and UniProt).

-------------------

## Saved information:

### Gene annotations
Check out the gene annotations saved directly into the gene. These are COBRApy Gene objects!
    - my_gempro.genes.get_by_id('b1187').annotation

### DataFrames
Check out the dataframes to see a summary of results.
    - my_gempro.df_uniprot_metadata
    - my_gempro.df_kegg_metadata
    - my_gempro.df_sequence_mapping  # Summary of the representative sequences

In [5]:
# Looking at the Gene object - empty for now
example_gene = my_gempro.genes[0]
example_gene.annotation

{'sequence': {'kegg': {'kegg_id': None,
   'metadata_file': None,
   'pdbs': [],
   'seq_file': None,
   'seq_len': 0,
   'uniprot_acc': None},
  'representative': {'kegg_id': None,
   'metadata_file': None,
   'pdbs': [],
   'seq_file': None,
   'seq_len': 0,
   'uniprot_acc': None},
  'uniprot': {}},
 'structure': {'homology': {},
  'pdb': OrderedDict(),
  'representative': {'clean_pdb_file': None,
   'original_pdb_file': None,
   'seq_coverage': 0,
   'structure_id': None}}}

In [6]:
# UniProt mapping of gene_id -> uniprot_id
my_gempro.uniprot_mapping_and_metadata(model_gene_source='ENSEMBLGENOME_ID')

INFO:root:getUserAgent: Begin
INFO:root:getUserAgent: user_agent: EBI-Sample-Client/ (services.pyc; Python 2.7.12; Linux) Python-requests/2.9.1
INFO:root:getUserAgent: End
100%|██████████| 11/11 [00:00<00:00, 2010.96it/s]
INFO:ssbio.pipeline.gempro:Created UniProt metadata dataframe. See the "df_uniprot_metadata" attribute.


In [7]:
# KEGG mapping of gene ids
my_gempro.kegg_mapping_and_metadata(kegg_organism_code='eco')

100%|██████████| 11/11 [00:00<00:00, 618.29it/s]
INFO:ssbio.pipeline.gempro:Created KEGG metadata dataframe. See the "df_kegg_metadata" attribute.


In [8]:
# Consolidate mappings sources
my_gempro.set_representative_sequence()

INFO:ssbio.pipeline.gempro:Created sequence mapping dataframe. See the "df_sequence_mapping" attribute.


In [9]:
# Looking at the KEGG dataframe
my_gempro.df_kegg_metadata.head()

Unnamed: 0,gene,uniprot_acc,kegg_id,seq_len,pdbs,seq_file,metadata_file
0,b1323,P07604,eco:b1323,513,[2JHE],eco-b1323.faa,eco-b1323.kegg
1,b0995,P38684,eco:b0995,230,[1ZGZ],eco-b0995.faa,eco-b0995.kegg
2,b0761,P0A9G8,eco:b0761,262,"[1B9M, 1H9S, 1B9N, 1O7L, 1H9R]",eco-b0761.faa,eco-b0761.kegg
3,b1221,P0AF28,eco:b1221,216,"[1JE8, 1A04, 1ZG1, 1ZG5, 1RNL]",eco-b1221.faa,eco-b1221.kegg
4,b1299,P0A9U6,eco:b1299,185,,eco-b1299.faa,eco-b1299.kegg


In [10]:
# Looking at information saved per gene
my_gempro.genes.get_by_id('b1187').annotation['sequence']

{'kegg': {'kegg_id': 'eco:b1187',
  'metadata_file': 'eco-b1187.kegg',
  'pdbs': ['1HW1', '1H9T', '1HW2', '1H9G', '1E2X'],
  'seq_file': 'eco-b1187.faa',
  'seq_len': 239,
  'uniprot_acc': 'P0A8V6'},
 'representative': {'kegg_id': ['ecj:JW1176', 'eco:b1187'],
  'metadata_file': 'P0A8V6.txt',
  'pdbs': ['1H9T', '1HW2', '1H9G', '1E2X', '1HW1'],
  'seq_file': 'P0A8V6.fasta',
  'seq_len': 239,
  'uniprot_acc': 'P0A8V6'},
 'uniprot': {'P0A8V6': {'description': ['Fatty acid metabolism regulator protein {ECO:0000255|HAMAP-Rule:MF_00696}'],
   'entry_version': '2016-10-05',
   'gene': 'b1187',
   'gene_name': 'fadR',
   'kegg_id': ['ecj:JW1176', 'eco:b1187'],
   'metadata_file': 'P0A8V6.txt',
   'pdbs': ['1H9T', '1HW2', '1H9G', '1E2X', '1HW1'],
   'pfam': ['PF07840', 'PF00392'],
   'refseq': ['NP_415705.1', 'NC_000913.3', 'WP_000234823.1', 'NZ_LN832404.1'],
   'reviewed': True,
   'seq_file': 'P0A8V6.fasta',
   'seq_len': 239,
   'seq_version': '2007-01-23',
   'uniprot_acc': 'P0A8V6'}}}

## Mapping sequence -> structure

There are two ways to map sequence to structure:
1. Simply use the UniProt ID and their automatic mappings to the PDB
2. BLAST the sequence to the PDB

### map_uniprot_to_pdb
This uses a service from the PDB to return a rank ordered list of PDBs that match a UniProt ID. 
- seq_ident_cutoff (from 0 to 1)
    - Provide the seq_ident_cutoff as a percentage to filter for structures with only a percent identity above the cutoff.
    - **Warning:** if you set the seq_ident_cutoff too high you risk filtering out PDBs that do match the sequence, but are just missing large portions of it.

### blast_seqs_to_pdb
This will BLAST the representative sequence against the entire PDB, and return significant hits. This will however return hits in other organisms, which may not be ideal.
- all_genes
    - Set to True if you want all genes and their sequences BLASTed
    - Set to False if you only want to BLAST sequences that did not have any PDBs from the function map_uniprot_to_pdb
- seq_ident_cutoff
    - Same as above
- evalue
    - Significance of BLAST results

-------------------

## Saved information:

### Gene annotations
    - my_gempro.genes.get_by_id('b1187').annotation['structure']['pdb']

### DataFrames
    - my_gempro.df_pdb_ranking
    - my_gempro.df_pdb_blast

In [11]:
# Mapping using the PDBe best_structures service
my_gempro.map_uniprot_to_pdb(seq_ident_cutoff=.3)

100%|██████████| 11/11 [00:03<00:00,  2.91it/s]
INFO:ssbio.pipeline.gempro:Completed UniProt -> best PDB mapping. See the "df_pdb_ranking" attribute.


In [12]:
# Mapping using BLAST
my_gempro.blast_seqs_to_pdb(all_genes=True, seq_ident_cutoff=.9, evalue=0.00001)

  0%|          | 0/11 [00:00<?, ?it/s]INFO:ssbio.pipeline.gempro:b1013: Adding 2 PDBs from BLAST results.
100%|██████████| 11/11 [00:00<00:00, 151.57it/s]
INFO:ssbio.pipeline.gempro:Completed sequence --> PDB BLAST. See the "df_pdb_blast" attribute.


In [13]:
# Looking at the PDB ranking dataframe
my_gempro.df_pdb_ranking.head()

Unnamed: 0,gene,uniprot_acc,pdb_id,pdb_chain_id,experimental_method,resolution,seq_coverage,release_date,taxonomy_id,pdb_start,pdb_end,unp_start,unp_end,rank
0,b1323,P07604,2jhe,A,X-ray diffraction,2.3,0.37,2008-06-24,83333,1,190,1,190,1
1,b1323,P07604,2jhe,B,X-ray diffraction,2.3,0.37,2008-06-24,83333,1,190,1,190,2
2,b1323,P07604,2jhe,C,X-ray diffraction,2.3,0.37,2008-06-24,83333,1,190,1,190,3
3,b1323,P07604,2jhe,D,X-ray diffraction,2.3,0.37,2008-06-24,83333,1,190,1,190,4
4,b0995,P38684,1zgz,A,X-ray diffraction,1.8,0.53,2005-12-13,562,1,122,1,122,1


In [14]:
# Looking at the BLAST results
my_gempro.df_pdb_blast.head()

Unnamed: 0,gene,pdb_id,pdb_chain_id,resolution,release_date,blast_score,blast_evalue,seq_coverage,seq_similar,seq_num_coverage,seq_num_similar
0,b0761,1b9n,A,2.09,2000-03-15,1091.0,4.7979900000000004e-119,0.931298,0.931298,244,244
1,b0761,1b9n,B,2.09,2000-03-15,1091.0,4.7979900000000004e-119,0.931298,0.931298,244,244
2,b0761,1b9m,A,1.75,2000-03-15,1091.0,4.7979900000000004e-119,0.931298,0.931298,244,244
3,b0761,1b9m,B,1.75,2000-03-15,1091.0,4.7979900000000004e-119,0.931298,0.931298,244,244
4,b0761,1o7l,A,2.75,2003-02-20,1089.0,9.510390000000001e-119,0.931298,0.931298,244,244


In [15]:
# Looking at the information saved per gene
my_gempro.genes.get_by_id('b1187').annotation['structure']['pdb']

OrderedDict([(('1hw1', 'A'),
              {'experimental_method': u'X-ray diffraction',
               'pdb_chain_id': 'A',
               'pdb_end': 239,
               'pdb_id': '1hw1',
               'pdb_start': 1,
               'rank': 1,
               'release_date': '2001-01-24',
               'resolution': 1.5,
               'seq_coverage': 1,
               'taxonomy_id': 562,
               'uniprot_acc': 'P0A8V6',
               'unp_end': 239,
               'unp_start': 1}),
             (('1hw1', 'B'),
              {'experimental_method': u'X-ray diffraction',
               'pdb_chain_id': 'B',
               'pdb_end': 239,
               'pdb_id': '1hw1',
               'pdb_start': 1,
               'rank': 2,
               'release_date': '2001-01-24',
               'resolution': 1.5,
               'seq_coverage': 1,
               'taxonomy_id': 562,
               'uniprot_acc': 'P0A8V6',
               'unp_end': 239,
               'unp_start': 1}),
    

## Ranking and downloading structures

### set_representative_structure
Rank available structures, run QC/QA, download and clean the final structure
### pdb_downloader_and_metadata
Download all structures per gene. This also adds some additional metadata to the annotation for each PDB.
    
-------------------

## Saved information:

### Gene annotations
    - my_gempro.genes.get_by_id('b1187').annotation['structure']['representative']

### DataFrames
    - my_gempro.df_pdb_metadata

In [16]:
# Download all mapped PDBs and gather the metadata
my_gempro.pdb_downloader_and_metadata()

100%|██████████| 11/11 [00:08<00:00,  1.15s/it]
INFO:ssbio.pipeline.gempro:Created PDB metadata dataframe.


In [17]:
# Look at the summary of structures
my_gempro.df_pdb_metadata.head()

Unnamed: 0,gene,pdb_id,pdb_chain_id,taxonomy_name,experimental_method,resolution,seq_coverage,chemicals,rank,release_date,pdb_file,mmcif_header
0,b1323,2jhe,A,ESCHERICHIA COLI,X-RAY DIFFRACTION,2.3,0.37,AE3;PG4;SO4,1.0,2008-06-24,2jhe.pdb,2jhe.header.cif
1,b1323,2jhe,B,ESCHERICHIA COLI,X-RAY DIFFRACTION,2.3,0.37,AE3;PG4;SO4,2.0,2008-06-24,2jhe.pdb,2jhe.header.cif
2,b1323,2jhe,C,ESCHERICHIA COLI,X-RAY DIFFRACTION,2.3,0.37,AE3;PG4;SO4,3.0,2008-06-24,2jhe.pdb,2jhe.header.cif
3,b1323,2jhe,D,ESCHERICHIA COLI,X-RAY DIFFRACTION,2.3,0.37,AE3;PG4;SO4,4.0,2008-06-24,2jhe.pdb,2jhe.header.cif
4,b0995,1zgz,A,Escherichia coli,X-RAY DIFFRACTION,1.8,0.53,SO4;GOL,1.0,2005-12-13,1zgz.pdb,1zgz.header.cif


In [18]:
# Looking at information saved per gene
my_gempro.genes.get_by_id('b1187').annotation['structure']['pdb']

OrderedDict([(('1hw1', 'A'),
              {'chemicals': ['SO4', 'ZN'],
               'experimental_method': 'X-RAY DIFFRACTION',
               'mmcif_header': '1hw1.header.cif',
               'pdb_chain_id': 'A',
               'pdb_end': 239,
               'pdb_file': '1hw1.pdb',
               'pdb_id': '1hw1',
               'pdb_start': 1,
               'rank': 1,
               'release_date': '2001-01-24',
               'resolution': 1.5,
               'seq_coverage': 1,
               'taxonomy_id': 562,
               'taxonomy_name': 'Escherichia coli',
               'uniprot_acc': 'P0A8V6',
               'unp_end': 239,
               'unp_start': 1}),
             (('1hw1', 'B'),
              {'chemicals': ['SO4', 'ZN'],
               'experimental_method': 'X-RAY DIFFRACTION',
               'mmcif_header': '1hw1.header.cif',
               'pdb_chain_id': 'B',
               'pdb_end': 239,
               'pdb_file': '1hw1.pdb',
               'pdb_id': '1hw1',

In [19]:
my_gempro.genes.get_by_id('b1187').annotation

{'sequence': {'kegg': {'kegg_id': 'eco:b1187',
   'metadata_file': 'eco-b1187.kegg',
   'pdbs': ['1HW1', '1H9T', '1HW2', '1H9G', '1E2X'],
   'seq_file': 'eco-b1187.faa',
   'seq_len': 239,
   'uniprot_acc': 'P0A8V6'},
  'representative': {'kegg_id': ['ecj:JW1176', 'eco:b1187'],
   'metadata_file': 'P0A8V6.txt',
   'pdbs': ['1H9T', '1HW2', '1H9G', '1E2X', '1HW1'],
   'seq_file': 'P0A8V6.fasta',
   'seq_len': 239,
   'uniprot_acc': 'P0A8V6'},
  'uniprot': {'P0A8V6': {'description': ['Fatty acid metabolism regulator protein {ECO:0000255|HAMAP-Rule:MF_00696}'],
    'entry_version': '2016-10-05',
    'gene': 'b1187',
    'gene_name': 'fadR',
    'kegg_id': ['ecj:JW1176', 'eco:b1187'],
    'metadata_file': 'P0A8V6.txt',
    'pdbs': ['1H9T', '1HW2', '1H9G', '1E2X', '1HW1'],
    'pfam': ['PF07840', 'PF00392'],
    'refseq': ['NP_415705.1',
     'NC_000913.3',
     'WP_000234823.1',
     'NZ_LN832404.1'],
    'reviewed': True,
    'seq_file': 'P0A8V6.fasta',
    'seq_len': 239,
    'seq_version