# README

This notebook gives an example of how to run the GEM-PRO pipeline on a **SBML model**.

## Installation

See: https://github.com/nmih/ssbio/blob/master/README.md
- If something isn't working, make sure to update the repository before you do anything (git pull)

## Quick start

I just want to get structures ASAP! How can I do that?

- my_gempro.run_pipeline()

In [1]:
# Import the GEM-PRO class
from ssbio.pipeline.gempro import GEMPRO

  warn("No LP solvers found")


In [2]:
# Other imports for this example
import os
import pandas as pd
import os.path as op

In [3]:
# Printing multiple outputs per cell
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

## Logging

Set the logging level to specify how verbose you want the pipeline to be. Debug is most verbose.

- CRITICAL
     - Only really important messages shown
- ERROR
     - Major errors
- WARNING
     - Warnings that don't affect running of the pipeline
- INFO
     - Info such as the number of structures mapped per gene
- DEBUG
     - Really detailed information that will print out a lot of stuff

In [4]:
# Create logger
import logging
logger = logging.getLogger()

############# SET YOUR LOGGING LEVEL HERE #############
logger.setLevel(logging.INFO)
#######################################################

## Initialization of the project

Set these three things:

- GEM_NAME
    - Your project name
- ROOT_DIR
    - The directory where the GEM_NAME folder will be created
- GENES
    - Your list of gene IDs

A directory will be created in ROOT_DIR named after your GEM_NAME. The folders are:
```
    .
    ├── data  # where dataframes are stored
    ├── figures  # where figures are stored
    ├── model  # where SBML and GEM-PRO models are stored
    ├── notebooks  # location of any ipython notebooks for analyses
    ├── sequences  # sequences are stored here, in gene specific folders
    │   ├── <gene_id1>
    │   │   └── sequence.fasta
    │   ├── <gene_id2>
    │   └── ...
    └── structures
        ├── by_gene  # structures are stored here, in gene specific folders
        │   ├── <gene_id1>
        │   │   └── 1abc.pdb
        │   ├── <gene_id2>
        │   └── ...
        └── by_complex  # complexes for reactions are stored here (in progress)
```

In [5]:
# Specify the folders and model
GEM_NAME = 'mtuberculosis_gp_atlas'
ROOT_DIR = '/home/nathan/projects_unsynced/'
GEM_FILE = '/home/nathan/projects_unsynced/mtuberculosis_gp_atlas/model/iNJ661.json'

In [6]:
# Create the GEM-PRO project
# Specify that the model is in json format
my_gempro = GEMPRO(gem_name=GEM_NAME, root_dir=ROOT_DIR, gem_file_path=GEM_FILE, gem_file_type='json')

INFO:ssbio.pipeline.gempro:Loaded model: /home/nathan/projects_unsynced/mtuberculosis_gp_atlas/model/iNJ661.json
INFO:ssbio.pipeline.gempro:Number of reactions: 1025
INFO:ssbio.pipeline.gempro:Number of reactions linked to a gene: 720
INFO:ssbio.pipeline.gempro:Number of genes (excluding spontaneous): 661
INFO:ssbio.pipeline.gempro:Number of metabolites: 826


## Mapping gene ID to sequence

**First, we want to set a sequence for each of the genes in our model.** This can happen one of two ways:

1. I want to map the gene IDs in the model (I don't have the original genome sequence)
2. I have the sequences for my organism

### 1. I want to map the gene IDs in the model

##### Mapping with KEGG

- Files created:
    - Per gene, the KEGG sequence and metadata is downloaded in the GEM-PRO "sequences" folder.
    

- Usage:
        my_gempro.kegg_mapping_and_metadata(kegg_organism_code, custom_gene_mapping=None, force_rerun=False)


- Arguments:

    - *kegg_organism_code*
        - See the full list of organisms here: http://www.genome.jp/kegg/catalog/org_list.html
        - E. coli MG1655 is "eco"
        - M. tuberculosis is "mtu"

    - *custom_gene_mapping*
        - If the model gene IDs differ from the KEGG gene IDs, and you know the mapping, supply it as a dictionary here. 
        - An example would be for the *T. maritima* model, where the model IDs are formatted without an underscore ("TM0001") while in KEGG they have an underscore ("TM_0001").

    - *force_rerun*
        - If you want to force the rerun of mapping, set this to True.


- Creates attributes:
    - A summary of the metadata is available in the "df_kegg_metadata" attribute.
            my_gempro.df_kegg_metadata
            
    - Any gene IDs that are missing a mapping are reported in the "missing_kegg_mapping" attribute.
            my_gempro.missing_kegg_mapping

##### Mapping with UniProt

- Method:
    - You can try mapping your genes using the actual service here: http://www.uniprot.org/uploadlists/
    
    
- Files created:
    - Per gene, the sequence and metadata is downloaded in the GEM-PRO "sequences" folder.
    

- Usage:
        my_gempro.uniprot_mapping_and_metadata(model_gene_source, custom_gene_mapping=None, force_rerun=False)
        
        
- Arguments:

    - *model_gene_source*
        - Here is a list of the gene IDs that can be mapped to UniProt IDs: http://www.uniprot.org/help/programmatic_access#id_mapping_examples
        - *E. coli* b-numbers are of the source "ENSEMBLGENOME_ID"
        - *M. tuberculosis* gene IDs match the source "TUBERCULIST_ID"
        
    - *custom_gene_mapping*
        - If you know the model gene ID to UniProt ID mapping, supply it as a dictionary here.
        
    - *force_rerun*
        - If you want to force the rerun of mapping, set this to True.
        
        
- Creates attributes:
    - A summary of the metadata is available in the "df_uniprot_metadata" attribute.
            my_gempro.df_uniprot_metadata
            
    - Any gene IDs that are missing a mapping are reported in the "missing_uniprot_mapping" attribute.
            my_gempro.missing_uniprot_mapping
        

##### Consolidating information

If you have mapped with both KEGG and UniProt mappers, then you can set a representative sequence for the gene using this function.

- Method:
    - Manual mappings override all existing mappings.
    - UniProt mappings override KEGG mappings except when KEGG mappings have PDBs associated with them and UniProt doesn't.


- Usage:
        my_gempro.set_representative_sequence()


- Creates attributes:
    - A summary of the mappings available in the "df_sequence_mapping" attribute.
            my_gempro.df_sequence_mapping
            
    - Any gene IDs that are missing a mapping are reported in the "missing_mapping" attribute.
            my_gempro.missing_mapping
        

---------------------------

### 2. I have the sequences for my organism

If you already have the amino acid sequences for each gene, simply set them as the representative sequence using the below function.

##### Setting the amino acid sequences

- Method:
    - This is to be used when you actually have the genome source as protein coding sequences, or are for some reason interested in a random collection of sequences.
    - These sequences are set as "representative" for each gene. That means you wouldn't have any KEGG or UniProt ID set as representative though. If you do run the above KEGG or UniProt mappings, it will check if they match your sequence and add in metadata.

- Usage:
        my_gempro.manual_seq_mapping(gene_to_seq_dict)
        
        
- Arguments:

    - *gene_to_seq_dict*
        - Supply a dictionary mapping the model gene IDs to their amino acid sequences.
        - Example: 
                {'Rv1295': 'MTVPPTATHQPWPGVIAAYRDRLPVGDDWTPVTLLEGGTPLIAATNLSK'}

-------------------

### Gene annotations

Check out the gene annotations saved directly into the gene. 

- **These are COBRApy Gene objects, transformed into GenePro objects. Basically, just a "protein" attribute is added. If you started with a SBML model, these are directly annotated within the model.**

- Usage:
    - Work with the protein attribute
            my_gempro.genes.get_by_id('Rv1295').protein
            my_gempro.genes.get_by_id('Rv1295').protein.sequences
    - Display all attributes of the first sequence added
            my_gempro.genes.get_by_id('Rv1295').protein.sequences[0].get_dict()

In [7]:
# Example of manual_seq_mapping
gene_to_seq_dict = {'Rv1295': 'MTVPPTATHQPWPGVIAAYRDRLPVGDDWTPVTLLEGGTPLIAATNLSKQTGCTIHLKVEGLNPTGSFKDRGMTMAVTDALAHGQRAVLCASTGNTSASAAAYAARAGITCAVLIPQGKIAMGKLAQAVMHGAKIIQIDGNFDDCLELARKMAADFPTISLVNSVNPVRIEGQKTAAFEIVDVLGTAPDVHALPVGNAGNITAYWKGYTEYHQLGLIDKLPRMLGTQAAGAAPLVLGEPVSHPETIATAIRIGSPASWTSAVEAQQQSKGRFLAASDEEILAAYHLVARVEGVFVEPASAASIAGLLKAIDDGWVARGSTVVCTVTGNGLKDPDTALKDMPSVSPVPVDPVAVVEKLGLA'}
my_gempro.manual_seq_mapping(gene_to_seq_dict)

INFO:ssbio.pipeline.gempro:Loaded in 1 sequences


In [8]:
my_gempro.genes.get_by_id('Rv1295').protein.sequences

[<SeqProp Rv1295 at 0x7fbae511bf60>]

In [9]:
# manual_seq_mapping will automatically set that sequence as the representative sequence
my_gempro.genes.get_by_id('Rv1295').protein.representative_sequence.get_dict()

{'bigg': None,
 'description': None,
 'gene_name': None,
 'id': 'Rv1295',
 'kegg': None,
 'metadata_file': None,
 'metadata_path': None,
 'pdbs': None,
 'refseq': None,
 'seq_record': None,
 'seq_str': None,
 'sequence_alignments': [],
 'sequence_file': 'Rv1295.faa',
 'sequence_len': 360,
 'sequence_path': '/home/nathan/projects_unsynced/mtuberculosis_gp_atlas/sequences/Rv1295/Rv1295.faa',
 'structure_alignments': [],
 'uniprot': None}

In [10]:
# KEGG mapping of gene ids
my_gempro.kegg_mapping_and_metadata(kegg_organism_code='mtu')
my_gempro.df_kegg_metadata.head(2)
my_gempro.missing_kegg_mapping

INFO:ssbio.pipeline.gempro:655 genes mapped to KEGG
INFO:ssbio.pipeline.gempro:Created KEGG metadata dataframe. See the "df_kegg_metadata" attribute.





Unnamed: 0,gene,kegg,refseq,uniprot,pdbs,sequence_len,sequence_file,metadata_file
0,Rv0417,mtu:Rv0417,NP_214931,P9WG73,,252,mtu-Rv0417.faa,mtu-Rv0417.kegg
1,Rv2291,mtu:Rv2291,NP_216807,P9WHF5,,284,mtu-Rv2291.faa,mtu-Rv2291.kegg


['Rv2321c', 'Rv0619', 'Rv0618', 'Rv2322c', 'Rv2233', 'Rv1755c']

In [11]:
# Looking at stored sequences
my_gempro.genes.get_by_id('Rv1295').protein.sequences

[<SeqProp Rv1295 at 0x7fbae511bf60>, <KEGGProp mtu:Rv1295 at 0x7fbae4de8be0>]

In [12]:
# If the automatic mapping maps to a sequence that is already set as a representative_sequence, the attributes are updated
my_gempro.genes.get_by_id('Rv1295').protein.representative_sequence.get_dict()

{'bigg': None,
 'description': None,
 'gene_name': None,
 'id': 'Rv1295',
 'kegg': 'mtu:Rv1295',
 'metadata_file': 'mtu-Rv1295.kegg',
 'metadata_path': '/home/nathan/projects_unsynced/mtuberculosis_gp_atlas/sequences/Rv1295/mtu-Rv1295.kegg',
 'pdbs': ['2D1F'],
 'refseq': 'NP_215811',
 'seq_record': None,
 'seq_str': None,
 'sequence_alignments': [],
 'sequence_file': 'Rv1295.faa',
 'sequence_len': 360,
 'sequence_path': '/home/nathan/projects_unsynced/mtuberculosis_gp_atlas/sequences/Rv1295/Rv1295.faa',
 'structure_alignments': [],
 'taxonomy': 'mtu  Mycobacterium tuberculosis H37Rv',
 'uniprot': 'P9WG59'}

In [13]:
# UniProt mapping of gene_id -> uniprot_id
my_gempro.uniprot_mapping_and_metadata(model_gene_source='TUBERCULIST_ID')
my_gempro.df_uniprot_metadata.head(2)
my_gempro.missing_uniprot_mapping[:10]

INFO:root:getUserAgent: Begin
INFO:root:getUserAgent: user_agent: EBI-Sample-Client/ (services.py; Python 3.5.2; Linux) Python-requests/2.11.1
INFO:root:getUserAgent: End
INFO:ssbio.pipeline.gempro:Created UniProt metadata dataframe. See the "df_uniprot_metadata" attribute.





Unnamed: 0,gene,uniprot,reviewed,gene_name,kegg,refseq,pdbs,ec_number,pfam,sequence_len,description,entry_version,seq_version,sequence_file,metadata_file
0,Rv0417,P9WG73,True,thiG,mtu:Rv0417,NP_214931.1;NC_000962.3;WP_003916659.1;NZ_KK33...,,2.8.1.10,,252,Thiazole synthase {ECO:0000255|HAMAP-Rule:MF_0...,2016-11-02,2014-04-16,P9WG73.fasta,P9WG73.txt
1,Rv2291,P9WHF5,True,sseB,mtu:Rv2291,NP_216807.1;NC_000962.3;WP_003899253.1;NZ_KK33...,,2.8.1.1,PF00581,284,Putative thiosulfate sulfurtransferase SseB,2016-11-02,2014-04-16,P9WHF5.fasta,P9WHF5.txt


['Rv3113',
 'Rv0156',
 'Rv3281',
 'Rv1511',
 'Rv1005c',
 'Rv3468c',
 'Rv3565',
 'Rv1915',
 'Rv2471',
 'Rv2379c']

In [14]:
my_gempro.genes.get_by_id('Rv1295').protein.sequences

[<SeqProp Rv1295 at 0x7fbae511bf60>,
 <KEGGProp mtu:Rv1295 at 0x7fbae4de8be0>,
 <UniProtProp P9WG59 at 0x7fbae4d872b0>]

In [15]:
my_gempro.genes.get_by_id('Rv1295').protein.representative_sequence.get_dict()

{'bigg': None,
 'description': ['TS', 'Threonine synthase'],
 'ec_number': ['4.2.3.1'],
 'entry_version': '2016-11-02',
 'gene_name': 'thrC',
 'id': 'Rv1295',
 'kegg': 'mtu:Rv1295',
 'metadata_file': 'mtu-Rv1295.kegg',
 'metadata_path': '/home/nathan/projects_unsynced/mtuberculosis_gp_atlas/sequences/Rv1295/mtu-Rv1295.kegg',
 'pdbs': ['2D1F'],
 'pfam': ['PF00291'],
 'refseq': 'NP_215811',
 'reviewed': True,
 'seq_record': None,
 'seq_str': None,
 'seq_version': '2014-04-16',
 'sequence_alignments': [],
 'sequence_file': 'Rv1295.faa',
 'sequence_len': 360,
 'sequence_path': '/home/nathan/projects_unsynced/mtuberculosis_gp_atlas/sequences/Rv1295/Rv1295.faa',
 'structure_alignments': [],
 'taxonomy': 'mtu  Mycobacterium tuberculosis H37Rv',
 'uniprot': 'P9WG59'}

In [16]:
# Manually adding in UniProt mappings later
manual_uniprot = pd.read_csv(op.join(my_gempro.data_dir, '161019-gene_to_uniprot.in'))
manual_uniprot_dict = {}
for i,r in manual_uniprot.iterrows():
    manual_uniprot_dict[r[0]] = r[1]
print(manual_uniprot_dict)

my_gempro.manual_uniprot_mapping(manual_uniprot_dict)
my_gempro.df_uniprot_metadata.tail(4)

INFO:ssbio.pipeline.gempro:Updated existing UniProt dataframe.


{'Rv2321c': 'P71891', 'Rv0619': 'Q79FY3', 'Rv0618': 'Q79FY4', 'Rv2322c': 'P71890', 'Rv1755c': 'P9WIA9'}


Unnamed: 0,gene,uniprot,reviewed,gene_name,kegg,refseq,pdbs,ec_number,pfam,sequence_len,description,entry_version,seq_version,sequence_file,metadata_file
657,Rv0619,Q79FY3,False,galTb,,,,,PF02744,181,Probable galactose-1-phosphate uridylyltransfe...,2016-11-02,2004-07-05,Q79FY3.fasta,Q79FY3.txt
658,Rv0618,Q79FY4,False,galTa,mtv:RVBD_0618,,,,PF01087,231,Probable galactose-1-phosphate uridylyltransfe...,2016-11-30,2004-07-05,Q79FY4.fasta,Q79FY4.txt
659,Rv2322c,P71890,False,rocD1,mtv:RVBD_2322c,WP_003411957.1;NZ_KK339370.1,,,PF00202,221,Probable ornithine aminotransferase (N-terminu...,2016-11-30,1997-02-01,P71890.fasta,P71890.txt
660,Rv1755c,P9WIA9,True,plcD,,,,3.1.4.3,PF04185,514,Phospholipase C 4,2016-11-02,2014-04-16,P9WIA9.fasta,P9WIA9.txt


In [17]:
my_gempro.genes.get_by_id('Rv2321c').protein.representative_sequence.get_dict()

{'bigg': None,
 'description': None,
 'gene_name': None,
 'id': 'P71891',
 'kegg': ['mtv:RVBD_2321c'],
 'metadata_file': 'P71891.txt',
 'metadata_path': '/home/nathan/projects_unsynced/mtuberculosis_gp_atlas/sequences/Rv2321c/P71891.txt',
 'pdbs': None,
 'refseq': None,
 'seq_record': None,
 'seq_str': None,
 'sequence_alignments': [],
 'sequence_file': 'P71891.fasta',
 'sequence_len': 181,
 'sequence_path': '/home/nathan/projects_unsynced/mtuberculosis_gp_atlas/sequences/Rv2321c/P71891.fasta',
 'structure_alignments': [],
 'uniprot': 'P71891'}

In [18]:
# Consolidate mappings sources
my_gempro.set_representative_sequence()
my_gempro.df_representative_sequences.head(2)
my_gempro.missing_repseq

INFO:ssbio.pipeline.gempro:Created sequence mapping dataframe. See the "df_representative_sequences" attribute.


Unnamed: 0,gene,uniprot,kegg,pdbs,sequence_len,sequence_file,metadata_file
0,Rv0417,P9WG73,mtu:Rv0417,,252,P9WG73.fasta,P9WG73.txt
1,Rv2291,P9WHF5,mtu:Rv2291,,284,P9WHF5.fasta,P9WHF5.txt


['Rv2233']

In [19]:
# Looking at sequence information saved per gene
my_gempro.genes.get_by_id('Rv2589').protein.representative_sequence.get_dict()

{'bigg': None,
 'description': None,
 'gene_name': None,
 'id': 'P9WQ79',
 'kegg': ['mtu:Rv2589'],
 'metadata_file': 'P9WQ79.txt',
 'metadata_path': '/home/nathan/projects_unsynced/mtuberculosis_gp_atlas/sequences/Rv2589/P9WQ79.txt',
 'pdbs': None,
 'refseq': None,
 'seq_record': None,
 'seq_str': None,
 'sequence_alignments': [],
 'sequence_file': 'P9WQ79.fasta',
 'sequence_len': 449,
 'sequence_path': '/home/nathan/projects_unsynced/mtuberculosis_gp_atlas/sequences/Rv2589/P9WQ79.fasta',
 'structure_alignments': [],
 'uniprot': 'P9WQ79'}

## Mapping representative sequence to structure

These are the ways to map sequence to structure:

1. Use the UniProt ID and their automatic mappings to the PDB
2. BLAST the sequence to the PDB
3. Make homology models or 
4. Map to existing homology models

You can only utilize option #1 to map to PDBs if there is a mapped UniProt ID set in the representative sequence. If not, you'll have to BLAST your sequence to the PDB or make a homology model. You can also run both for maximum coverage.

---------------------

### 1. Use the UniProt ID and their automatic mappings to the PDB


- Method:
    - Use the PDBe REST service to query for the best PDB structures for a UniProt ID.
    - Here is the ranking algorithm described by the PDB paper: https://nar.oxfordjournals.org/content/44/D1/D385.full
    - More information found here: https://www.ebi.ac.uk/pdbe/api/doc/sifts.html
    - Link used to retrieve results: https://www.ebi.ac.uk/pdbe/api/mappings/best_structures/:accession
    - The list of PDB structures mapping to a UniProt accession sorted by coverage of the protein and, if the same, resolution.


- Files created:
    - Saves a .json file directly from the web request in the GEM-PRO "sequences" folder
    - No PDBs are downloaded yet
    
    
- Usage:
        my_gempro.map_uniprot_to_pdb(seq_ident_cutoff=0, force_rerun=False)
        
        
- Arguments:

    - *seq_ident_cutoff*
        - From 0 to 1
        - Provide the seq_ident_cutoff as a percentage to filter for structures with only a percent identity above the cutoff.
        - **Warning:** if you set the seq_ident_cutoff too high you risk filtering out PDBs that do match the sequence, but are just missing large portions of it.
        
    - *force_rerun*
        - If you want to force the rerun of mapping, set this to True.
        
        
- Creates attributes:

    - A summary of the rankings is available in the "df_pdb_ranking" attribute.
            my_gempro.df_pdb_ranking
    
        

---------------------------

### 2. BLAST the sequence to the PDB

This will BLAST the representative sequence against the entire PDB, and return significant hits. This will however return hits in other organisms, which may not be ideal.


- Method:
    - BLAST the representative sequence to the entire PDB.


- Files created:
    - Saves a .xml file directly from the web request in the GEM-PRO "sequences" folder
    - No PDBs are downloaded yet
    
    
- Usage:
        my_gempro.blast_seqs_to_pdb(seq_ident_cutoff=0, evalue=0.0001, all_genes=False, force_rerun=False, display_link=False)
        
        
- Arguments:

    - *seq_ident_cutoff*
        - From 0 to 1
        - Provide the seq_ident_cutoff as a percentage to filter for structures with only a percent identity above the cutoff.
        - **Warning:** if you set the seq_ident_cutoff too high you risk filtering out PDBs that do match the sequence, but are just missing large portions of it.
        
    - *evalue*
        - Cutoff for the E-value - filters for significant hits.
        - 0.001 is liberal, 0.0001 is stringent (default).
        
    - *all_genes*
        - Set to True if you want to BLAST all gene sequences
        - Set to False if you only want to BLAST genes without any PDBs (if map_uniprot_to_pdb was already run)
        
    - *force_rerun*
        - If you want to force the rerun of mapping, set this to True.
        
    - *display_link*
        - Display a link to the HTML results of the BLAST result per gene.
        
        
- Creates attributes:

    - A summary of the rankings is available in the "df_pdb_blast" attribute.
            my_gempro.df_pdb_blast

---------------------------

### 3. Make homology models

This will prepare sequences for homology modeling using I-TASSER or allow you to organize already generated ones.

- Method:
    - Prepare representative sequences for I-TASSER runs on your local machine, or using Torque (qsub, available on ssb0-ssb4) or SLURM (sbatch, currently used on NERSC) job scheduler systems.


- Files created:
    - Creates homology modeling files in a specified directory
    
    
- Usage:
        my_gempro.prep_itasser_models(outdir, itasser_installation, itlib_location, runtype, print_exec=False, **kwargs)
        
        
- Arguments:

    - *outdir*
        - outdir
        
    - *itasser_installation*
        - text
        
    - *itlib_location*
        - text
        
    - *runtype*
        - text
        
    - *print_exec*
        - text
        
    - *kwargs*
        - text

-------------------

### 4. Map to existing homology models (I-TASSER)

This will organize the homology models generated by I-TASSER

- Method:
    - text


- Files created:
    - Copies homology models and a couple summary result files into the GEM-PRO "structures" directory
    
    
- Usage:
        my_gempro.get_itasser_models(homology_raw_dir, custom_itasser_name_mapping=None, force_rerun=False)
        
        
- Arguments:

    - *homology_raw_dir*
        - 
        
    - *custom_itasser_name_mapping*
        - 
        
- Creates attributes:

    - A summary of the I-TASSER modeling is available in the "df_itasser" attribute.
            my_gempro.df_itasser
            

### 4. Map to existing homology models (Generic)

This will map to existing homology models

- Method:
    - text


- Files created:
    - Copies homology models into the GEM-PRO "structures" directory
    
    
- Usage:
        my_gempro.manual_homology_models(input_dict)
        
        
- Arguments:

    - *input_dict*
        -  Dictionary of dictionaries of gene names to homology model IDs and information. Input a dict of:
                {model_gene: {homology_model_id1: {'model_file': '/path/to/homology/model',
                                                  'other_info': 'other_info_here',
                                                  ...},
                              homology_model_id2: {'model_file': '/path/to/homology/model',
                                                  'other_info': 'other_info_here',
                                                  ...}}}
            

-------------------

### Gene annotations

Check out the gene annotations saved directly into the gene.

- Usage:
        my_gempro.genes.get_by_id('Rv1295').protein.structures


In [20]:
# Mapping using the PDBe best_structures service
my_gempro.map_uniprot_to_pdb()
my_gempro.df_pdb_ranking.head()

INFO:root:getUserAgent: Begin
INFO:root:getUserAgent: user_agent: EBI-Sample-Client/ (services.py; Python 3.5.2; Linux) Python-requests/2.11.1
INFO:root:getUserAgent: End
INFO:ssbio.pipeline.gempro:Completed UniProt -> best PDB mapping. See the "df_pdb_ranking" attribute.
INFO:ssbio.pipeline.gempro:178: number of genes with at least one structure
INFO:ssbio.pipeline.gempro:482: number of genes with no structures





Unnamed: 0,gene,uniprot,pdb_id,pdb_chain_id,experimental_method,resolution,coverage,tax_id,start,end,unp_start,unp_end,rank
0,Rv1295,P9WG59,2d1f,A,X-ray diffraction,2.5,1.0,1773,1,360,1,360,1
1,Rv1295,P9WG59,2d1f,B,X-ray diffraction,2.5,1.0,1773,1,360,1,360,2
2,Rv1201c,P9WP21,3fsy,A,X-ray diffraction,1.97,0.997,83332,4,319,2,317,1
3,Rv1201c,P9WP21,3fsy,B,X-ray diffraction,1.97,0.997,83332,4,319,2,317,2
4,Rv1201c,P9WP21,3fsy,C,X-ray diffraction,1.97,0.997,83332,4,319,2,317,3


In [21]:
my_gempro.genes.get_by_id('Rv1295').protein.structures

[<PDBProp 2d1f at 0x7fbae4d95438>]

In [22]:
my_gempro.genes.get_by_id('Rv1295').protein.structures[0].get_dict()

{'chains': [<ChainProp A at 0x7fbae4d51908>, <ChainProp B at 0x7fbae4d51cc0>],
 'date': None,
 'description': None,
 'experimental_method': 'X-ray diffraction',
 'file_type': None,
 'id': '2d1f',
 'is_experimental': True,
 'mapped_chains': ['A', 'B'],
 'reference_seq': <SeqProp Rv1295 at 0x7fbae4d51128>,
 'reference_seq_top_coverage': 0,
 'release_date': None,
 'representative_chain': None,
 'resolution': 2.5,
 'structure': None,
 'structure_file': None,
 'structure_path': None,
 'tax_id': 1773}

In [23]:
# Mapping using BLAST
my_gempro.blast_seqs_to_pdb(seq_ident_cutoff=.95, all_genes=True)
my_gempro.df_pdb_blast.head()

INFO:ssbio.pipeline.gempro:Rv1908c: Adding 2 PDBs from BLAST results.
INFO:ssbio.pipeline.gempro:Rv3307: Adding 1 PDBs from BLAST results.
INFO:ssbio.pipeline.gempro:Rv2965c: Adding 1 PDBs from BLAST results.
INFO:ssbio.pipeline.gempro:Rv1603: Adding 2 PDBs from BLAST results.
INFO:ssbio.pipeline.gempro:Rv2247: Adding 2 PDBs from BLAST results.
INFO:ssbio.pipeline.gempro:Rv0211: Adding 8 PDBs from BLAST results.
INFO:ssbio.pipeline.gempro:Rv1392: Adding 1 PDBs from BLAST results.
INFO:ssbio.pipeline.gempro:Rv0467: Adding 1 PDBs from BLAST results.
INFO:ssbio.pipeline.gempro:Rv1415: Adding 1 PDBs from BLAST results.
INFO:ssbio.pipeline.gempro:Rv2220: Adding 1 PDBs from BLAST results.
INFO:ssbio.pipeline.gempro:Rv2043c: Adding 1 PDBs from BLAST results.
INFO:ssbio.pipeline.gempro:Rv2245: Adding 9 PDBs from BLAST results.
INFO:ssbio.pipeline.gempro:Rv0642c: Adding 1 PDBs from BLAST results.
INFO:ssbio.pipeline.gempro:Rv0126: Adding 1 PDBs from BLAST results.
INFO:ssbio.pipeline.gempro:Rv2




Unnamed: 0,gene,pdb_id,pdb_chain_id,hit_score,hit_evalue,hit_percent_similar,hit_percent_ident,hit_num_ident,hit_num_similar
0,Rv1908c,4c50,A,3652.0,0.0,0.974324,0.974324,721,721
1,Rv1908c,4c50,B,3652.0,0.0,0.974324,0.974324,721,721
2,Rv1908c,4c51,A,3651.0,0.0,0.974324,0.974324,721,721
3,Rv1908c,4c51,B,3651.0,0.0,0.974324,0.974324,721,721
4,Rv3307,1n3i,A,1100.0,5.3680500000000006e-120,0.955224,0.955224,256,256


In [24]:
# Organizing homology models (only specific to this example)
old_gene_to_homology = pd.read_csv(op.join(my_gempro.data_dir,'161031-old_gene_to_uniprot_mapping.csv'))
gene_to_uniprot = old_gene_to_homology.set_index('m_gene').to_dict()['u_uniprot_acc']

my_gempro.get_itasser_models(homology_raw_dir='/home/nathan/projects_archive/homology_models/MTUBERCULOSIS/raw', custom_itasser_name_mapping=gene_to_uniprot)
my_gempro.df_itasser.head()

INFO:ssbio.pipeline.gempro:Completed copying of 428 I-TASSER models to GEM-PRO directory
INFO:ssbio.pipeline.gempro:428 I-TASSER models total
INFO:ssbio.pipeline.gempro:See the "df_itasser" attribute for a summary dataframe





Unnamed: 0,gene,model_file,model_date,difficulty,top_template_pdb,top_template_chain,c_score,tm_score,tm_score_err,rmsd,rmsd_err
0,Rv0417,Rv0417_model1.pdb,2015-12-30,easy,2htm,C,1.66,0.95,0.05,2.6,1.9
1,Rv2291,Rv2291_model1.pdb,2016-01-04,easy,3olh,A,1.38,0.91,0.06,3.3,2.3
2,Rv1559,Rv1559_model1.pdb,2016-01-08,easy,1tdj,A,0.73,0.81,0.09,5.4,3.4
3,Rv3113,Rv3113_model1.pdb,2015-12-30,easy,3sd7,A,0.72,0.81,0.09,4.1,2.8
4,Rv2447c,Rv2447c_model1.pdb,2016-01-08,easy,2vos,A,0.07,0.72,0.11,7.1,4.2


In [25]:
my_gempro.genes.get_by_id('Rv0417').protein.structures

[<ITASSERProp P9WG73 at 0x7fbae40d6a20>]

In [26]:
my_gempro.genes.get_by_id('Rv0417').protein.structures[0].get_dict()

{'c_score': 1.66,
 'chains': [],
 'create_dfs': True,
 'description': None,
 'difficulty': 'easy',
 'file_type': 'pdb',
 'id': 'P9WG73',
 'is_experimental': False,
 'mapped_chains': [],
 'model_date': '2015-12-30',
 'model_file': 'Rv0417_model1.pdb',
 'model_to_use': 'model1',
 'reference_seq': <SeqProp P9WG73 at 0x7fbae41336d8>,
 'reference_seq_top_coverage': 0,
 'representative_chain': None,
 'results_path': '/home/nathan/projects_archive/homology_models/MTUBERCULOSIS/raw/P9WG73',
 'rmsd': 2.6,
 'rmsd_err': 1.9,
 'structure': None,
 'structure_file': 'Rv0417_model1.pdb',
 'structure_path': '/home/nathan/projects_unsynced/mtuberculosis_gp_atlas/structures/by_gene/Rv0417/Rv0417_model1.pdb',
 'tm_score': 0.95,
 'tm_score_err': 0.05,
 'top_template_chain': 'C',
 'top_template_pdb': '2htm'}

## Setting a representative structure

Once you've mapped PDBs and homology models, then you need to choose between all of the available ones. 


- Method:
    - 


- Files downloaded:
    - 
    
    
- Usage:
        my_gempro.set_representative_structure(always_use_homology=True, 
                                               sort_homology_by='seq_coverage',
                                               allow_missing_on_termini=0.1, 
                                               allow_mutants=True, 
                                               allow_deletions=False,
                                               allow_insertions=False, 
                                               allow_unresolved=True, 
                                               force_rerun=False)
        
        
- Arguments:

    - *always_use_homology*
        - 
        
    - *sort_homology_by*
        - 
            

-------------------

### Gene annotations

Check out the gene annotations saved directly into the gene.

- Usage:
        my_gempro.genes.get_by_id('Rv1295').annotation['structure']['representative']

In [27]:
my_gempro.set_representative_structure()
my_gempro.df_representative_structures.head()

INFO:ssbio.pipeline.gempro:Created representative structures dataframe. See the "df_representative_structures" attribute.





Unnamed: 0,gene,id,is_experimental,reference_seq,reference_seq_top_coverage,structure_file
0,Rv0417,P9WG73-X,False,P9WG73,100.0,Rv0417_model1-X_clean.pdb
1,Rv2291,P9WHF5-X,False,P9WHF5,100.0,Rv2291_model1-X_clean.pdb
2,Rv1295,2d1f-A,True,Rv1295,96.9,2d1f-A_clean.pdb
3,Rv1559,P9WG95-X,False,P9WG95,100.0,Rv1559_model1-X_clean.pdb
4,Rv3113,O05790-X,False,mtu:Rv3113,100.0,Rv3113_model1-X_clean.pdb


In [28]:
my_gempro.genes.get_by_id('Rv1295').protein.structures

[<PDBProp 2d1f at 0x7fbae4d95438>]

In [29]:
my_gempro.genes.get_by_id('Rv1295').protein.structures[0].get_dict(exclude_attributes=['structure'])

{'chains': [<ChainProp A at 0x7fba531ea128>, <ChainProp B at 0x7fba531ea0b8>],
 'date': None,
 'description': None,
 'experimental_method': 'X-ray diffraction',
 'file_type': 'cif',
 'id': '2d1f',
 'is_experimental': True,
 'mapped_chains': ['A', 'B'],
 'reference_seq': <SeqProp Rv1295 at 0x7fba531ea7b8>,
 'reference_seq_top_coverage': 96.9,
 'release_date': None,
 'representative_chain': <ChainProp A at 0x7fba531ea828>,
 'resolution': 2.5,
 'structure_file': '2d1f.cif',
 'structure_path': '/home/nathan/projects_unsynced/mtuberculosis_gp_atlas/structures/by_gene/Rv1295/2d1f.cif',
 'tax_id': 1773}

## Property calculations

In [30]:
my_gempro.calculate_sequence_properties()
my_gempro.df_sequence_properties.head()




KeyError: 'sequence'

In [22]:
my_gempro.calculate_residue_depth()

Widget Javascript not detected.  It may not be installed properly. Did you enable the widgetsnbextension? If not, then run "jupyter nbextension enable --py --sys-prefix widgetsnbextension"


163/|/ 25%|| 163/661 [01:40<05:05,  1.63it/s]



          189/|/ 29%|| 189/661 [04:40<11:40,  1.48s/it]



          382/|/ 58%|| 382/661 [27:41<20:13,  4.35s/it]



448/|/ 68%|| 448/661 [34:21<16:20,  4.60s/it]



          488/|/ 74%|| 488/661 [36:11<12:49,  4.45s/it]



          552/|/ 84%|| 552/661 [41:22<08:10,  4.50s/it]



          565/|/ 85%|| 565/661 [43:22<07:22,  4.61s/it]



          619/|/ 94%|| 619/661 [48:42<03:18,  4.72s/it]



          633/|/ 96%|| 633/661 [51:42<02:17,  4.90s/it]



          645/|/ 98%|| 645/661 [52:42<01:18,  4.90s/it]

INFO:ssbio.pipeline.gempro:Completed calculations of residue depth





In [23]:
my_gempro.genes.get_by_id('Rv1295').annotation['sequence']['representative']['properties']

{'percent_acidic': 0.09167,
 'percent_aliphatic': 0.38333,
 'percent_aromatic': 0.06944,
 'percent_basic': 0.09722,
 'percent_charged': 0.18889,
 'percent_non-polar': 0.625,
 'percent_polar': 0.375,
 'percent_small': 0.61944,
 'percent_tiny': 0.38333}

In [24]:
my_gempro.genes.get_by_id('Rv1295').annotation['structure']['representative']['properties']

{}

## Saving your GEM-PRO

In [25]:
import cobra.io
cobra.io.save_json_model(my_gempro.model, op.join(my_gempro.model_dir, 'iNJ661_GP.json'))