# GEM-PRO - SBML Model (iNJ661)

This notebook gives an example of how to run the GEM-PRO pipeline with a **SBML model**, in this case *i*NJ661, the metabolic model of *M. tuberculosis*.

<div class="alert alert-info">

**Input:** 
GEM (in SBML, JSON, or MAT formats)

</div>

<div class="alert alert-info">

**Output:**
GEM-PRO model

</div>

## Imports

In [1]:
import sys
import logging

In [2]:
# Import the GEM-PRO class
from ssbio.pipeline.gempro import GEMPRO

In [3]:
# Printing multiple outputs per cell
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

### Logging

Set the logging level in `logger.setLevel(logging.<LEVEL_HERE>)` to specify how verbose you want the pipeline to be. Debug is most verbose.

- `CRITICAL`
     - Only really important messages shown
- `ERROR`
     - Major errors
- `WARNING`
     - Warnings that don't affect running of the pipeline
- `INFO` (default)
     - Info such as the number of structures mapped per gene
- `DEBUG`
     - Really detailed information that will print out a lot of stuff
     
<div class="alert alert-warning">

**Warning:** 
`DEBUG` mode prints out a large amount of information, especially if you have a lot of genes. This may stall your notebook!
</div>

In [4]:
# Create logger
logger = logging.getLogger()
logger.setLevel(logging.INFO)  # SET YOUR LOGGING LEVEL HERE #

In [5]:
# Other logger stuff for Jupyter notebooks
handler = logging.StreamHandler(sys.stderr)
formatter = logging.Formatter('[%(asctime)s] [%(name)s] %(levelname)s: %(message)s', datefmt="%Y-%m-%d %H:%M")
handler.setFormatter(formatter)
logger.handlers = [handler]

## Initialization of the project

Set these three things:

- `ROOT_DIR`
    - The directory where a folder named after your `PROJECT` will be created
- `PROJECT`
    - Your project name
- `LIST_OF_GENES`
    - Your list of gene IDs
    
A directory will be created in `ROOT_DIR` with your `PROJECT` name. The folders are organized like so:
```
    ROOT_DIR
    └── PROJECT
        ├── data  # General storage for pipeline outputs
        ├── model  # SBML and GEM-PRO models are stored here
        ├── genes  # Per gene information
        │   ├── <gene_id1>  # Specific gene directory
        │   │   └── protein
        │   │       ├── sequences  # Protein sequence files, alignments, etc.
        │   │       └── structures  # Protein structure files, calculations, etc.
        │   └── <gene_id2>
        │       └── protein
        │           ├── sequences
        │           └── structures
        ├── reactions  # Per reaction information
        │   └── <reaction_id1>  # Specific reaction directory
        │       └── complex
        │           └── structures  # Protein complex files
        └── metabolites  # Per metabolite information
            └── <metabolite_id1>  # Specific metabolite directory
                └── chemical
                    └── structures  # Metabolite 2D and 3D structure files
                
```

<div class="alert alert-info">**Note:** Methods for protein complexes and metabolites are still in development.</div>

In [6]:
# SET FOLDERS AND DATA HERE
import tempfile
ROOT_DIR = tempfile.gettempdir()

PROJECT = 'iNJ661_GP'
GEM_FILE = '/home/nathan/Downloads/iNJ661.json'
GEM_FILE_TYPE = 'json'

In [7]:
# Create the GEM-PRO project
my_gempro = GEMPRO(gem_name=PROJECT, root_dir=ROOT_DIR, gem_file_path=GEM_FILE, gem_file_type=GEM_FILE_TYPE)

[2017-03-08 13:18] [ssbio.pipeline.gempro] INFO: iNJ661: loaded model
[2017-03-08 13:18] [ssbio.pipeline.gempro] INFO: 1025: number of reactions
[2017-03-08 13:18] [ssbio.pipeline.gempro] INFO: 720: number of reactions linked to a gene
[2017-03-08 13:18] [ssbio.pipeline.gempro] INFO: 661: number of genes (excluding spontaneous)
[2017-03-08 13:18] [ssbio.pipeline.gempro] INFO: 826: number of metabolites
[2017-03-08 13:18] [ssbio.pipeline.gempro] INFO: /tmp/iNJ661_GP: GEM-PRO project location
[2017-03-08 13:18] [ssbio.pipeline.gempro] INFO: 661: number of genes


## Mapping gene ID --> sequence

First, we need to map these IDs to their protein sequences. There are 2 ID mapping services provided to do this - through **KEGG** or **UniProt**. The end goal is to map a UniProt ID to each ID, since there is a comprehensive mapping (and some useful APIs) between UniProt and the PDB.

<p><div class="alert alert-info">**Note:** You only need to map gene IDs using one service. However you can run both if some genes don't map in one service and do map in another!</div></p>

However, you don't need to map using these services if you already have the amino acid sequences for each protein. You can just manually load in the sequences as shown using the method `manual_seq_mapping`. Or, if you already have the UniProt IDs, you can load those in using the method `manual_uniprot_mapping`.

### Methods

In [8]:
gene_to_seq_dict = {'Rv1295': 'MTVPPTATHQPWPGVIAAYRDRLPVGDDWTPVTLLEGGTPLIAATNLSKQTGCTIHLKVEGLNPTGSFKDRGMTMAVTDALAHGQRAVLCASTGNTSASAAAYAARAGITCAVLIPQGKIAMGKLAQAVMHGAKIIQIDGNFDDCLELARKMAADFPTISLVNSVNPVRIEGQKTAAFEIVDVLGTAPDVHALPVGNAGNITAYWKGYTEYHQLGLIDKLPRMLGTQAAGAAPLVLGEPVSHPETIATAIRIGSPASWTSAVEAQQQSKGRFLAASDEEILAAYHLVARVEGVFVEPASAASIAGLLKAIDDGWVARGSTVVCTVTGNGLKDPDTALKDMPSVSPVPVDPVAVVEKLGLA',
                    'Rv2233': 'VSSPRERRPASQAPRLSRRPPAHQTSRSSPDTTAPTGSGLSNRFVNDNGIVTDTTASGTNCPPPPRAAARRASSPGESPQLVIFDLDGTLTDSARGIVSSFRHALNHIGAPVPEGDLATHIVGPPMHETLRAMGLGESAEEAIVAYRADYSARGWAMNSLFDGIGPLLADLRTAGVRLAVATSKAEPTARRILRHFGIEQHFEVIAGASTDGSRGSKVDVLAHALAQLRPLPERLVMVGDRSHDVDGAAAHGIDTVVVGWGYGRADFIDKTSTTVVTHAATIDELREALGV'}
my_gempro.manual_seq_mapping(gene_to_seq_dict)

[2017-03-08 13:18] [ssbio.pipeline.gempro] INFO: Loaded in 2 sequences


In [9]:
manual_uniprot_dict = {'Rv1755c': 'P9WIA9', 'Rv2321c': 'P71891', 'Rv0619': 'Q79FY3', 'Rv0618': 'Q79FY4', 'Rv2322c': 'P71890'}
my_gempro.manual_uniprot_mapping(manual_uniprot_dict)
my_gempro.df_uniprot_metadata.tail(4)

[2017-03-08 13:18] [ssbio.pipeline.gempro] INFO: Completed manual ID mapping --> UniProt. See the "df_uniprot_metadata" attribute for a summary dataframe.


Unnamed: 0_level_0,uniprot,reviewed,gene_name,kegg,refseq,num_pdbs,ec_number,pfam,seq_len,description,entry_version,seq_version,sequence_file,metadata_file
gene,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
Rv0619,Q79FY3,False,galTb,,,0,,PF02744,181,Probable galactose-1-phosphate uridylyltransfe...,2017-02-15,2004-07-05,Q79FY3.fasta,Q79FY3.txt
Rv0618,Q79FY4,False,galTa,mtv:RVBD_0618,,0,,PF01087,231,Probable galactose-1-phosphate uridylyltransfe...,2017-02-15,2004-07-05,Q79FY4.fasta,Q79FY4.txt
Rv2321c,P71891,False,rocD2,mtv:RVBD_2321c,,0,,PF00202,181,Probable ornithine aminotransferase (C-terminu...,2017-02-15,1997-02-01,P71891.fasta,P71891.txt
Rv2322c,P71890,False,rocD1,mtv:RVBD_2322c,WP_003411957.1;NZ_KK339370.1,0,,PF00202,221,Probable ornithine aminotransferase (N-terminu...,2017-02-15,1997-02-01,P71890.fasta,P71890.txt


In [10]:
# KEGG mapping of gene ids
my_gempro.kegg_mapping_and_metadata(kegg_organism_code='mtu')
print('Missing KEGG mapping: ', my_gempro.missing_kegg_mapping)
my_gempro.df_kegg_metadata.head()

[2017-03-08 13:18] [ssbio.pipeline.gempro] INFO: 655/661: number of genes mapped to KEGG
[2017-03-08 13:18] [ssbio.pipeline.gempro] INFO: Completed ID mapping --> KEGG. See the "df_kegg_metadata" attribute for a summary dataframe.



Missing KEGG mapping:  ['Rv0619', 'Rv1755c', 'Rv2322c', 'Rv2321c', 'Rv2233', 'Rv0618']


Unnamed: 0_level_0,kegg,refseq,uniprot,num_pdbs,pdbs,seq_len,sequence_file,metadata_file
gene,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
Rv0417,mtu:Rv0417,NP_214931,P9WG73,0,,252,mtu-Rv0417.faa,mtu-Rv0417.kegg
Rv2291,mtu:Rv2291,NP_216807,P9WHF5,0,,284,mtu-Rv2291.faa,mtu-Rv2291.kegg
Rv3737,mtu:Rv3737,NP_218254,O69704,0,,529,mtu-Rv3737.faa,mtu-Rv3737.kegg
Rv1295,mtu:Rv1295,NP_215811,P9WG59,1,2D1F,360,mtu-Rv1295.faa,mtu-Rv1295.kegg
Rv1559,mtu:Rv1559,NP_216075,P9WG95,0,,429,mtu-Rv1559.faa,mtu-Rv1559.kegg


In [11]:
# UniProt mapping
my_gempro.uniprot_mapping_and_metadata(model_gene_source='TUBERCULIST_ID')
print('Missing UniProt mapping: ', my_gempro.missing_uniprot_mapping)
my_gempro.df_uniprot_metadata.head()

[2017-03-08 13:18] [root] INFO: getUserAgent: Begin
[2017-03-08 13:18] [root] INFO: getUserAgent: user_agent: EBI-Sample-Client/ (services.py; Python 3.5.2; Linux) Python-requests/2.12.4
[2017-03-08 13:18] [root] INFO: getUserAgent: End
[2017-03-08 13:18] [ssbio.pipeline.gempro] INFO: 589/661: number of genes mapped to UniProt
[2017-03-08 13:18] [ssbio.pipeline.gempro] INFO: Completed ID mapping --> UniProt. See the "df_uniprot_metadata" attribute for a summary dataframe.



Missing UniProt mapping:  ['Rv1512', 'Rv3784', 'Rv1618', 'Rv0266c', 'Rv0375c', 'Rv2320c', 'Rv3317', 'Rv1915', 'Rv3469c', 'Rv0958', 'Rv2233', 'Rv0147', 'Rv0649', 'Rv2318', 'Rv2858c', 'Rv0253', 'Rv0155', 'Rv0753c', 'Rv2398c', 'Rv2471', 'Rv0143c', 'Rv3777', 'Rv2590', 'Rv0317c', 'Rv0993', 'Rv1704c', 'Rv3737', 'Rv3281', 'Rv3331', 'Rv3758c', 'Rv0860', 'Rv2436', 'Rv2382c', 'Rv1163', 'Rv3113', 'Rv3759c', 'Rv0974c', 'Rv3379c', 'Rv2062c', 'Rv1662', 'Rv2316', 'Rv1647', 'Rv1928c', 'Rv2380c', 'Rv2379c', 'Rv2833c', 'Rv1164', 'Rv0156', 'Rv0727c', 'Rv0812', 'Rv1005c', 'Rv1127c', 'Rv1511', 'Rv3332', 'Rv1844c', 'Rv3468c', 'Rv3565', 'Rv1902c', 'Rv1162', 'Rv2381c', 'Rv0082', 'Rv2458', 'Rv0252', 'Rv2671', 'Rv0511', 'Rv1916', 'Rv1239c', 'Rv2524c']


Unnamed: 0_level_0,uniprot,reviewed,gene_name,kegg,refseq,num_pdbs,pdbs,ec_number,pfam,seq_len,description,entry_version,seq_version,sequence_file,metadata_file
gene,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
Rv0417,P9WG73,True,thiG,mtu:Rv0417,NP_214931.1;NC_000962.3;WP_003916659.1;NZ_KK33...,0,,2.8.1.10,PF05690,252,Thiazole synthase {ECO:0000255|HAMAP-Rule:MF_0...,2017-02-15,2014-04-16,P9WG73.fasta,P9WG73.txt
Rv2291,P9WHF5,True,sseB,mtu:Rv2291,NP_216807.1;NC_000962.3;WP_003899253.1;NZ_KK33...,0,,2.8.1.1,PF00581,284,Putative thiosulfate sulfurtransferase SseB,2017-02-15,2014-04-16,P9WHF5.fasta,P9WHF5.txt
Rv1295,P9WG59,True,thrC,mtu:Rv1295,NP_215811.1;NC_000962.3;WP_003406652.1;NZ_KK33...,1,2D1F,4.2.3.1,PF00291,360,TS;Threonine synthase,2017-02-15,2014-04-16,P9WG59.fasta,P9WG59.txt
Rv1559,P9WG95,True,ilvA,mtu:Rv1559,NP_216075.1;NC_000962.3;WP_003407781.1;NZ_KK33...,0,,4.3.1.19,PF00291;PF00585,429,Threonine deaminase;L-threonine dehydratase bi...,2017-02-15,2014-04-16,P9WG95.fasta,P9WG95.txt
Rv2447c,I6Y0R5,True,folC,mtu:Rv2447c;mtv:RVBD_2447c,NP_216963.1;NC_000962.3;WP_003899324.1;NZ_KK33...,2,2VOS;2VOR,6.3.2.17;6.3.2.12,PF02875;PF08245,487,Tetrahydrofolylpolyglutamate synthase;Folylpol...,2017-02-15,2012-10-03,I6Y0R5.fasta,I6Y0R5.txt


In [12]:
# Set representative sequences
my_gempro.set_representative_sequence()
print('Missing a representative sequence: ', my_gempro.missing_representative_sequence)
my_gempro.df_representative_sequences.head()

[2017-03-08 13:18] [ssbio.pipeline.gempro] INFO: 661/661: number of genes with a representative sequence
[2017-03-08 13:18] [ssbio.pipeline.gempro] INFO: See the "df_representative_sequences" attribute for a summary dataframe.



Missing a representative sequence:  []


Unnamed: 0_level_0,uniprot,kegg,num_pdbs,pdbs,seq_len,sequence_file
gene,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Rv0417,P9WG73,mtu:Rv0417,0,,252,P9WG73.fasta
Rv2291,P9WHF5,mtu:Rv2291,0,,284,P9WHF5.fasta
Rv3737,O69704,mtu:Rv3737,0,,529,mtu-Rv3737.faa
Rv1295,P9WG59,mtu:Rv1295,1,2D1F,360,Rv1295.faa
Rv1559,P9WG95,mtu:Rv1559,0,,429,P9WG95.fasta


## Mapping representative sequence --> structure

These are the ways to map sequence to structure:

1. Use the UniProt ID and their automatic mappings to the PDB
2. BLAST the sequence to the PDB
3. Make homology models or 
4. Map to existing homology models

You can only utilize option #1 to map to PDBs if there is a mapped UniProt ID set in the representative sequence. If not, you'll have to BLAST your sequence to the PDB or make a homology model. You can also run both for maximum coverage.

### Methods

In [13]:
# Mapping using the PDBe best_structures service
my_gempro.map_uniprot_to_pdb(seq_ident_cutoff=.3)
my_gempro.df_pdb_ranking.head()

[2017-03-08 13:18] [ssbio.pipeline.gempro] INFO: Mapping UniProt IDs --> PDB IDs...
[2017-03-08 13:18] [root] INFO: getUserAgent: Begin
[2017-03-08 13:18] [root] INFO: getUserAgent: user_agent: EBI-Sample-Client/ (services.py; Python 3.5.2; Linux) Python-requests/2.12.4
[2017-03-08 13:18] [root] INFO: getUserAgent: End
[2017-03-08 13:18] [ssbio.pipeline.gempro] INFO: 176/661: number of genes with at least one experimental structure
[2017-03-08 13:18] [ssbio.pipeline.gempro] INFO: Completed UniProt --> best PDB mapping. See the "df_pdb_ranking" attribute for a summary dataframe.





Unnamed: 0_level_0,pdb_id,pdb_chain_id,uniprot,experimental_method,resolution,coverage,start,end,unp_start,unp_end,rank
gene,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
Rv1295,2d1f,A,P9WG59,X-ray diffraction,2.5,1.0,1,360,1,360,1
Rv1295,2d1f,B,P9WG59,X-ray diffraction,2.5,1.0,1,360,1,360,2
Rv1201c,3fsy,A,P9WP21,X-ray diffraction,1.97,0.997,4,319,2,317,1
Rv1201c,3fsy,B,P9WP21,X-ray diffraction,1.97,0.997,4,319,2,317,2
Rv1201c,3fsy,C,P9WP21,X-ray diffraction,1.97,0.997,4,319,2,317,3


In [14]:
# Mapping using BLAST
my_gempro.blast_seqs_to_pdb(all_genes=True, seq_ident_cutoff=.9, evalue=0.00001)
my_gempro.df_pdb_blast.head(2)

[2017-03-08 13:18] [ssbio.pipeline.gempro] INFO: Completed sequence --> PDB BLAST. See the "df_pdb_blast" attribute for a summary dataframe.
[2017-03-08 13:18] [ssbio.pipeline.gempro] INFO: 31: number of genes with additional structures added from BLAST





Unnamed: 0_level_0,pdb_id,pdb_chain_id,hit_score,hit_evalue,hit_percent_similar,hit_percent_ident,hit_num_ident,hit_num_similar
gene,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
Rv3846,1gn2,A,1023.0,2.6572999999999997e-111,0.995169,0.995169,206,206
Rv3846,1gn2,B,1023.0,2.6572999999999997e-111,0.995169,0.995169,206,206


In [15]:
tb_homology_dir = '/home/nathan/projects_archive/homology_models/MTUBERCULOSIS/'

##### EXAMPLE SPECIFIC CODE #####
# Needed to map to older IDs used in this example
import pandas as pd
import os.path as op
old_gene_to_homology = pd.read_csv(op.join(tb_homology_dir, 'data/161031-old_gene_to_uniprot_mapping.csv'))
gene_to_uniprot = old_gene_to_homology.set_index('m_gene').to_dict()['u_uniprot_acc']
my_gempro.get_itasser_models(homology_raw_dir=op.join(tb_homology_dir, 'raw'), custom_itasser_name_mapping=gene_to_uniprot)
### END EXAMPLE SPECIFIC CODE ###

# Organizing I-TASSER homology models
my_gempro.get_itasser_models(homology_raw_dir=op.join(tb_homology_dir, 'raw'))
my_gempro.df_homology_models.head()

[2017-03-08 13:18] [ssbio.pipeline.gempro] INFO: Completed copying of 435 I-TASSER models to GEM-PRO directory. See the "df_homology_models" attribute for a summary dataframe.





[2017-03-08 13:18] [ssbio.pipeline.gempro] INFO: Completed copying of 9 I-TASSER models to GEM-PRO directory. See the "df_homology_models" attribute for a summary dataframe.





Unnamed: 0_level_0,c_score,difficulty,id,model_date,model_file,rmsd,rmsd_err,tm_score,tm_score_err,top_template_chain,top_template_pdb
gene,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
Rv0417,1.66,easy,P9WG73,2015-12-30,P9WG73_model1.pdb,2.6,1.9,0.95,0.05,C,2htm
Rv2291,1.38,easy,P9WHF5,2016-01-04,P9WHF5_model1.pdb,3.3,2.3,0.91,0.06,A,3olh
Rv1559,0.73,easy,P9WG95,2016-01-08,P9WG95_model1.pdb,5.4,3.4,0.81,0.09,A,1tdj
Rv3113,0.72,easy,O05790,2015-12-30,O05790_model1.pdb,4.1,2.8,0.81,0.09,A,3sd7
Rv2447c,0.07,easy,I6Y0R5,2016-01-08,I6Y0R5_model1.pdb,7.1,4.2,0.72,0.11,A,2vos


In [16]:
homology_model_dict = {}
my_gempro.manual_homology_models(homology_model_dict)

[2017-03-08 13:18] [ssbio.pipeline.gempro] INFO: Updated homology model information for 0 genes.





## Downloading and ranking structures

### Methods

In [None]:
# Download all mapped PDBs and gather the metadata
my_gempro.pdb_downloader_and_metadata()
my_gempro.df_pdb_metadata.head(2)

608/|/ 92%|| 608/661 [18:01<01:34,  1.78s/it]

[2017-03-08 13:39] [ssbio.pipeline.gempro] INFO: Updated PDB metadata dataframe. See the "df_pdb_metadata" attribute for a summary dataframe.
[2017-03-08 13:39] [ssbio.pipeline.gempro] INFO: Saved 942 structures total





Unnamed: 0_level_0,chemicals,date,description,experimental_method,mapped_chains,pdb_id,pdb_title,resolution,structure_file,taxonomy_name
gene,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
Rv1295,PLP,2006-09-05;2009-02-24;2009-04-28;2011-07-13,Threonine synthase (E.C.4.2.3.1),X-ray diffraction,A;B,2d1f,Structure of Mycobacterium tuberculosis threon...,2.5,2d1f.cif,Mycobacterium tuberculosis
Rv1201c,SCA;MPD;MG;NA;ACY,2009-06-23;2011-07-13,Tetrahydrodipicolinate N-succinyltransferase (...,X-ray diffraction,A;B;C;D;E,3fsy,Structure of tetrahydrodipicolinate N-succinyl...,1.97,3fsy.cif,Mycobacterium tuberculosis


In [None]:
# Set representative structures
my_gempro.set_representative_structure()
my_gempro.df_representative_structures.head()



In [None]:
# Looking at the information saved within a gene
my_gempro.genes.get_by_id('Rv1295').protein.representative_structure
my_gempro.genes.get_by_id('Rv1295').protein.representative_structure.get_dict()

## Creating homology models

For those proteins with no representative structure, we can create homology models for them. `ssbio` contains some built in functions for easily running [I-TASSER](http://zhanglab.ccmb.med.umich.edu/I-TASSER/download/) locally or on machines with `SLURM` (ie. on NERSC) or `Torque` job scheduling.

You can load in I-TASSER models once they complete using the `get_itasser_models` later.

<p><div class="alert alert-info">**Info:** Homology modeling can take a long time - about 24-72 hours per protein (highly dependent on the sequence length, as well as if there are available templates).</div></p>

### Methods

In [None]:
# Prep I-TASSER model folders
my_gempro.prep_itasser_models('~/software/I-TASSER4.4', '~/software/ITLIB/', runtype='local', all_genes=False)

## Saving your GEM-PRO

<p><div class="alert alert-warning">**Warning:** Saving is still experimental. For a full GEM-PRO with sequences & structures, depending on the number of genes, saving can take >5 minutes.</div></p>

In [None]:
import os.path as op
my_gempro.save_json(op.join(my_gempro.model_dir, '{}.json'.format(my_gempro.id)), compression=False)