# README

This notebook gives an example of how to run the GEM-PRO pipeline on a **list of gene IDs** (in Python 2, but there are no differences for Python 3).

### Installation

See: https://github.com/nmih/ssbio/blob/master/README.md
- If something isn't working, make sure to update the repository before you do anything (git pull)

In [1]:
# Import the GEM-PRO class
from ssbio.pipeline.gempro import GEMPRO

In [2]:
# Create logger
import logging
logger = logging.getLogger()

############# SET YOUR LOGGING LEVEL HERE #############
logger.setLevel(logging.INFO)
#######################################################

In [3]:
# Printing multiple outputs per cell
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

## Initialization of the project

Set these three things:

- GEM_NAME
    - Your project name
- ROOT_DIR
    - The directory where the GEM_NAME folder will be created
- GENES
    - Your list of gene IDs
    
A directory will be created in ROOT_DIR named after your GEM_NAME. The folders are:
```
    .
    ├── data  # where dataframes are stored
    ├── figures  # where figures are stored
    ├── model  # where SBML and GEM-PRO models are stored
    ├── notebooks  # location of any ipython notebooks for analyses
    ├── sequences  # sequences are stored here, in gene specific folders
    │   ├── <gene_id1>
    │   │   └── sequence.fasta
    │   ├── <gene_id2>
    │   └── ...
    └── structures
        ├── by_gene  # structures are stored here, in gene specific folders
        │   ├── <gene_id1>
        │   │   └── 1abc.pdb
        │   ├── <gene_id2>
        │   └── ...
        └── by_complex  # complexes for reactions are stored here (in progress)
```

In [4]:
GEM_NAME = 'gempro_py2_test'

ROOT_DIR = '/home/nathan/projects_unsynced/'

GENES = ['b0761','b0889','b0995','b1013','b1014','b1040','b1130','b1187','b1221','b1299']

In [5]:
# Create the GEM-PRO project
my_gempro = GEMPRO(gem_name=GEM_NAME, root_dir=ROOT_DIR, genes_list=GENES)

INFO:ssbio.pipeline.gempro:Number of genes: 11


## Mapping gene ID -> sequence

**You only need to map gene IDs using one service.** However you can try both if some genes don't map in one service and map in another!

### kegg_mapping_and_metadata
- kegg_organism_code
    - See the full list of organisms here: http://www.genome.jp/kegg/catalog/org_list.html
    - E. coli MG1655 is "eco"

### uniprot_mapping_and_metadata
You can try mapping your genes using the actual service here:
http://www.uniprot.org/uploadlists/

- model_gene_source
    - Here is a list of the gene IDs that can be mapped to UniProt IDs: http://www.uniprot.org/help/programmatic_access#id_mapping_examples
    - E. coli b-numbers in this example are of the source "ENSEMBLGENOME_ID"

### set_representative_sequence
This function allows you to consolidate sources of ID mapping (currently KEGG and UniProt).

-------------------

## Saved information:

### Gene annotations
Check out the gene annotations saved directly into the gene. These are COBRApy Gene objects!
    - my_gempro.genes.get_by_id('b1187').annotation

### DataFrames
Check out the dataframes to see a summary of results.
    - my_gempro.df_uniprot_metadata
    - my_gempro.df_kegg_metadata
    - my_gempro.df_sequence_mapping  # Summary of the representative sequences

In [6]:
# UniProt mapping of gene_id -> uniprot_id
my_gempro.uniprot_mapping_and_metadata(model_gene_source='ENSEMBLGENOME_ID')
my_gempro.df_uniprot_metadata.head(2)

INFO:root:getUserAgent: Begin
INFO:root:getUserAgent: user_agent: EBI-Sample-Client/ (services.pyc; Python 2.7.12; Linux) Python-requests/2.9.1
INFO:root:getUserAgent: End
INFO:ssbio.pipeline.gempro:Created UniProt metadata dataframe. See the "df_uniprot_metadata" attribute.





Unnamed: 0,gene,uniprot,reviewed,gene_name,kegg,refseq,pdbs,ec_number,pfam,sequence_len,description,entry_version,seq_version,sequence_file,metadata_file
0,b1323,P07604,True,tyrR,ecj:JW1316;eco:b1323,NP_415839.1;NC_000913.3;WP_001300658.1;NZ_LN83...,2JHE,,PF13188;PF00158,513,Transcriptional regulatory protein TyrR,2016-11-02,1991-11-01,P07604.fasta,P07604.txt
1,b0995,P38684,True,torR,ecj:JW0980;eco:b0995,NP_415515.1;NC_000913.3;WP_001120125.1;NZ_LN83...,1ZGZ,,PF00072;PF00486,230,TorCAD operon transcriptional regulatory prote...,2016-11-02,1997-11-01,P38684.fasta,P38684.txt


In [7]:
# KEGG mapping of gene ids
my_gempro.kegg_mapping_and_metadata(kegg_organism_code='eco')
my_gempro.df_kegg_metadata.head()

INFO:ssbio.pipeline.gempro:11 genes mapped to KEGG
INFO:ssbio.pipeline.gempro:Created KEGG metadata dataframe. See the "df_kegg_metadata" attribute.





Unnamed: 0,gene,kegg,refseq,uniprot,pdbs,sequence_len,sequence_file,metadata_file
0,b1323,eco:b1323,NP_415839,P07604,2JHE,513,eco-b1323.faa,eco-b1323.kegg
1,b0995,eco:b0995,NP_415515,P38684,1ZGZ,230,eco-b0995.faa,eco-b0995.kegg
2,b0761,eco:b0761,NP_415282,P0A9G8,1B9M;1H9S;1B9N;1O7L;1H9R,262,eco-b0761.faa,eco-b0761.kegg
3,b1221,eco:b1221,NP_415739,P0AF28,1JE8;1A04;1ZG1;1ZG5;1RNL,216,eco-b1221.faa,eco-b1221.kegg
4,b1299,eco:b1299,NP_415815,P0A9U6,,185,eco-b1299.faa,eco-b1299.kegg


In [8]:
# Consolidate mappings sources
my_gempro.set_representative_sequence()
my_gempro.df_representative_sequences.head()

INFO:ssbio.pipeline.gempro:Created sequence mapping dataframe. See the "df_representative_sequences" attribute.


Unnamed: 0,gene,uniprot,kegg,pdbs,sequence_len,sequence_file,metadata_file
0,b1323,P07604,ecj:JW1316;eco:b1323,2JHE,513,P07604.fasta,P07604.txt
1,b0995,P38684,ecj:JW0980;eco:b0995,1ZGZ,230,P38684.fasta,P38684.txt
2,b0761,P0A9G8,ecj:JW0744;eco:b0761,1O7L;1H9S;1B9N;1H9R;1B9M,262,P0A9G8.fasta,P0A9G8.txt
3,b1221,P0AF28,ecj:JW1212;eco:b1221,1ZG1;1A04;1ZG5;1JE8;1RNL,216,P0AF28.fasta,P0AF28.txt
4,b1299,P0A9U6,ecj:JW1292;eco:b1299,,185,P0A9U6.fasta,P0A9U6.txt


In [9]:
# Looking at information saved per gene
my_gempro.genes.get_by_id('b1187').protein.sequences
my_gempro.genes.get_by_id('b1187').protein.sequences[0].get_dict()

[<UniProtProp P0A8V6 at 0x7fbf794d8f50>,
 <KEGGProp eco:b1187 at 0x7fbf7949c8d0>]

{'bigg': None,
 'description': ['Fatty acid metabolism regulator protein {ECO:0000255|HAMAP-Rule:MF_00696}'],
 'ec_number': None,
 'entry_version': '2016-11-02',
 'gene_name': 'fadR',
 'id': u'P0A8V6',
 'kegg': ['ecj:JW1176', 'eco:b1187'],
 'metadata_file': 'P0A8V6.txt',
 'metadata_path': '/home/nathan/projects_unsynced/gempro_py2_test/sequences/b1187/P0A8V6.txt',
 'pdbs': ['1H9T', '1HW2', '1H9G', '1E2X', '1HW1'],
 'pfam': ['PF07840', 'PF00392'],
 'refseq': ['NP_415705.1', 'NC_000913.3', 'WP_000234823.1', 'NZ_LN832404.1'],
 'reviewed': True,
 'seq_record': None,
 'seq_str': None,
 'seq_version': '2007-01-23',
 'sequence_alignments': [],
 'sequence_file': 'P0A8V6.fasta',
 'sequence_len': 239,
 'sequence_path': '/home/nathan/projects_unsynced/gempro_py2_test/sequences/b1187/P0A8V6.fasta',
 'structure_alignments': [],
 'uniprot': u'P0A8V6'}

## Mapping sequence -> structure

There are two ways to map sequence to structure:
1. Simply use the UniProt ID and their automatic mappings to the PDB
2. BLAST the sequence to the PDB

### map_uniprot_to_pdb
This uses a service from the PDB to return a rank ordered list of PDBs that match a UniProt ID. 
- seq_ident_cutoff (from 0 to 1)
    - Provide the seq_ident_cutoff as a percentage to filter for structures with only a percent identity above the cutoff.
    - **Warning:** if you set the seq_ident_cutoff too high you risk filtering out PDBs that do match the sequence, but are just missing large portions of it.

### blast_seqs_to_pdb
This will BLAST the representative sequence against the entire PDB, and return significant hits. This will however return hits in other organisms, which may not be ideal.
- all_genes
    - Set to True if you want all genes and their sequences BLASTed
    - Set to False if you only want to BLAST sequences that did not have any PDBs from the function map_uniprot_to_pdb
- seq_ident_cutoff
    - Same as above
- evalue
    - Significance of BLAST results

-------------------

## Saved information:

### Gene annotations
    - my_gempro.genes.get_by_id('b1187').annotation['structure']['pdb']

### DataFrames
    - my_gempro.df_pdb_ranking
    - my_gempro.df_pdb_blast

In [10]:
# Mapping using the PDBe best_structures service
my_gempro.map_uniprot_to_pdb(seq_ident_cutoff=.3)
my_gempro.df_pdb_ranking.head()

INFO:root:getUserAgent: Begin
INFO:root:getUserAgent: user_agent: EBI-Sample-Client/ (services.pyc; Python 2.7.12; Linux) Python-requests/2.9.1
INFO:root:getUserAgent: End
INFO:ssbio.pipeline.gempro:Completed UniProt -> best PDB mapping. See the "df_pdb_ranking" attribute.
INFO:ssbio.pipeline.gempro:9: number of genes with at least one structure
INFO:ssbio.pipeline.gempro:2: number of genes with no structures





Unnamed: 0,gene,uniprot,pdb_id,pdb_chain_id,experimental_method,resolution,coverage,taxonomy_name,start,end,unp_start,unp_end,rank
0,b1323,P07604,2jhe,A,X-ray diffraction,2.3,0.37,,1,190,1,190,1
1,b1323,P07604,2jhe,B,X-ray diffraction,2.3,0.37,,1,190,1,190,2
2,b1323,P07604,2jhe,C,X-ray diffraction,2.3,0.37,,1,190,1,190,3
3,b1323,P07604,2jhe,D,X-ray diffraction,2.3,0.37,,1,190,1,190,4
4,b0995,P38684,1zgz,A,X-ray diffraction,1.8,0.53,,1,122,1,122,1


In [11]:
# Mapping using BLAST
my_gempro.blast_seqs_to_pdb(all_genes=True, seq_ident_cutoff=.9, evalue=0.00001)
my_gempro.df_pdb_blast.head()

INFO:ssbio.pipeline.gempro:b1013: Adding 1 PDBs from BLAST results.
INFO:ssbio.pipeline.gempro:Completed sequence --> PDB BLAST. See the "df_pdb_blast" attribute.





Unnamed: 0,gene,pdb_id,pdb_chain_id,hit_score,hit_evalue,hit_percent_similar,hit_percent_ident,hit_num_ident,hit_num_similar
0,b1013,4x1e,A,966.0,1.2658e-104,0.910377,0.910377,193,193
1,b1013,4x1e,B,966.0,1.2658e-104,0.910377,0.910377,193,193


In [12]:
# Looking at information saved per gene
my_gempro.genes.get_by_id('b1187').protein.structures
my_gempro.genes.get_by_id('b1187').protein.structures[0].get_dict()

[<PDBProp 1hw1 at 0x7fbf793f37d0>,
 <PDBProp 1e2x at 0x7fbf793f3910>,
 <PDBProp 1h9g at 0x7fbf793f39d0>,
 <PDBProp 1h9t at 0x7fbf793f3a50>,
 <PDBProp 1hw2 at 0x7fbf793f3ad0>]

{'chains': [<ChainProp A at 0x7fbf791d0d90>, <ChainProp B at 0x7fbf791d0c90>],
 'date': None,
 'description': None,
 'experimental_method': u'X-ray diffraction',
 'file_type': None,
 'id': '1hw1',
 'is_experimental': True,
 'mapped_chains': ['A', 'B'],
 'reference_seq': <SeqProp P0A8V6 at 0x7fbf791d0dd0>,
 'reference_seq_top_coverage': 0,
 'representative_chain': None,
 'resolution': 1.5,
 'structure': None,
 'structure_file': None,
 'structure_path': None,
 'taxonomy_name': None}

## Ranking and downloading structures

### set_representative_structure
Rank available structures, run QC/QA, download and clean the final structure
### pdb_downloader_and_metadata
Download all structures per gene. This also adds some additional metadata to the annotation for each PDB.
    
-------------------

## Saved information:

### Gene annotations
    - my_gempro.genes.get_by_id('b1187').annotation['structure']['representative']

### DataFrames
    - my_gempro.df_pdb_metadata

In [None]:
my_gempro.set_representative_structure()
my_gempro.df_representative_structures.head()

INFO:ssbio.pipeline.gempro:Created representative structures dataframe. See the "df_representative_structures" attribute.





Unnamed: 0,gene,id,is_experimental,reference_seq,reference_seq_top_coverage,structure_file
0,b0761,1b9m-A,True,P0A9G8,96.2,1b9m-A_clean.pdb
1,b1221,1a04-A,True,P0AF28,94.9,1a04-A_clean.pdb
2,b1187,1hw1-A,True,P0A8V6,94.6,1hw1-A_clean.pdb
3,b1013,4jyk-A,True,P0ACU2,94.8,4jyk-A_clean.pdb
4,b0889,2gqq-A,True,P0ACJ0,93.3,2gqq-A_clean.pdb


In [None]:
# Download all mapped PDBs and gather the metadata
my_gempro.pdb_downloader_and_metadata()
my_gempro.df_pdb_metadata.head()