# README

This notebook gives an example of how to run the GEM-PRO pipeline on a **one gene ID** for simplicity.

### Installation

See: https://github.com/nmih/ssbio/blob/master/README.md
- If something isn't working, make sure to update the repository before you do anything (git pull)

In [1]:
# Import the GEM-PRO class
from ssbio.pipeline.gempro import GEMPRO

  warn("No LP solvers found")


In [2]:
# Create logger
import logging
logger = logging.getLogger()

############# SET YOUR LOGGING LEVEL HERE #############
logger.setLevel(logging.INFO)
#######################################################

In [3]:
# Printing multiple outputs per cell
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

## Initialization of the project

Set these three things:

- GEM_NAME
    - Your project name
- ROOT_DIR
    - The directory where the GEM_NAME folder will be created
- GENES_AND_SEQUENCES
    - Your gene and its sequence in a dictionary
    
A directory will be created in ROOT_DIR named after your GEM_NAME. The folders are:
```
    .
    ├── data  # where dataframes are stored
    ├── figures  # where figures are stored
    ├── model  # where SBML and GEM-PRO models are stored
    ├── notebooks  # location of any ipython notebooks for analyses
    ├── sequences  # sequences are stored here, in gene specific folders
    │   ├── <gene_id1>
    │   │   └── sequence.fasta
    │   └── ...
    └── structures
        └── by_gene  # structures are stored here, in gene specific folders
            └── <gene_id1>
                └── 1abc.pdb
```

In [4]:
gene_id = 'SRR1753782_00918'
gene_seq = 'MSKQQIGVVGMAVMGRNLALNIESRGYTVSVFNRSREKTEEVIAENPGKKLVPYYTVKEFVESLETPRRILLMVKAGAGTDAAIDSLKPYLEKGDIIIDGGNTFFQDTIRRNRELSAEGFNFIGTGVSGGEEGALKGPSIMPGGQKDAYELVAPILTKIAAVAEDGEPCVTYIGADGAGHYVKMVHNGIEYGDMQLIAEAYSLLKGGLNLSNEELANTFTEWNNGELSSYLIDITKDIFTKKDEDGNYLVDVILDEAANKGTGKWTSQSALDLGEPLSLITESVFARYISSLKAQRVAASKVLSGPKAQPAGDKAEFIEKVRRALYLGKIVSYAQGFSQLRAASDEYHWDLNYGEIAKIFRAGCIIRAQFLQKITDAYAENADIANLLLAPYFKKIADEYQQALRDVVAYAVQNGIPVPTFSAAVAYYDSYRAAVLPANLIQAQRDYFGAHTYKRTDKEGIFHTEWLE'
GENES_AND_SEQUENCES = {gene_id: gene_seq}

In [5]:
my_gempro = GEMPRO(gem_name='tester', root_dir='/home/nathan/Downloads/', genes_and_sequences=GENES_AND_SEQUENCES)

INFO:ssbio.pipeline.gempro:/home/nathan/Downloads/tester: created directory
INFO:ssbio.pipeline.gempro:/home/nathan/Downloads/tester/model: created directory
INFO:ssbio.pipeline.gempro:/home/nathan/Downloads/tester/data: created directory
INFO:ssbio.pipeline.gempro:/home/nathan/Downloads/tester/sequences: created directory
INFO:ssbio.pipeline.gempro:/home/nathan/Downloads/tester/structures: created directory
INFO:ssbio.pipeline.gempro:/home/nathan/Downloads/tester/structures/by_gene: created directory
INFO:ssbio.pipeline.gempro:/home/nathan/Downloads/tester: GEM-PRO project files location
INFO:ssbio.pipeline.gempro:Loaded in 1 sequences
INFO:ssbio.pipeline.gempro:1: number of genes


## Mapping sequence -> structure

In this notebook we just show how to:
- BLAST the sequence to the PDB

### blast_seqs_to_pdb
This will BLAST the representative sequence against the entire PDB, and return significant hits. This will however return hits in other organisms, which may not be ideal.
- all_genes
    - Set to True if you want all genes and their sequences BLASTed
    - Set to False if you only want to BLAST sequences that did not have any PDBs from the function map_uniprot_to_pdb
- seq_ident_cutoff
    - Same as above
- evalue
    - Significance of BLAST results

-------------------

## Saved information:

### Gene annotations
    - my_gempro.genes.get_by_id('b1187').protein.structures

### DataFrames
    - my_gempro.df_pdb_blast

In [6]:
my_gempro.blast_seqs_to_pdb(all_genes=True, seq_ident_cutoff=.05, evalue=0.00001)
my_gempro.df_pdb_blast.head()

INFO:ssbio.pipeline.gempro:SRR1753782_00918: adding 28 PDBs from BLAST results
INFO:ssbio.pipeline.gempro:Completed sequence --> PDB BLAST. See the "df_pdb_blast" attribute.





Unnamed: 0,gene,pdb_id,pdb_chain_id,hit_score,hit_evalue,hit_percent_similar,hit_percent_ident,hit_num_ident,hit_num_similar
0,SRR1753782_00918,2zyd,A,2319.0,0.0,0.987179,0.963675,451,462
1,SRR1753782_00918,2zyd,B,2319.0,0.0,0.987179,0.963675,451,462
2,SRR1753782_00918,2zya,A,2319.0,0.0,0.987179,0.963675,451,462
3,SRR1753782_00918,2zya,B,2319.0,0.0,0.987179,0.963675,451,462
4,SRR1753782_00918,3fwn,A,2312.0,0.0,0.987179,0.963675,451,462


## Ranking and downloading structures

### set_representative_structure
Rank available structures, run QC/QA, download and clean the final structure
### pdb_downloader_and_metadata
Download all structures per gene. This also adds some additional metadata to the annotation for each PDB.
    

-------------------

## Saved information:

### Gene annotations
    - my_gempro.genes.get_by_id('b1187').protein.representative_structure

### DataFrames
    - my_gempro.df_pdb_metadata

In [7]:
my_gempro.set_representative_structure()

INFO:ssbio.pipeline.gempro:Created representative structures dataframe. See the "df_representative_structures" attribute.





## Comparing new sequences

You can load additional sequences into this gene/protein object to align to the original sequence that was loaded in.

### load_manual_sequence
Loads in a new sequence string to be compared to the original sequence

### align_sequences_to_representative
Aligns all sequences that are loaded to the representative sequence
    
-------------------

## Saved information:

### Gene annotations
    - my_gempro.genes.get_by_id('b1187').protein.representative_sequence.sequence_alignments

In [8]:
import os.path as op

In [9]:
my_gene = my_gempro.genes[0]
gene_dir = op.join(my_gempro.sequence_dir, my_gene.id)
my_protein = my_gene.protein

In [10]:
mutated_id = 'mutated'
mutated_seq = 'MSKQQIGVVGMAVMGRPLALNIESRGYTVSVFNRSREKTEEVIAENPGKKLVPYYTVKEFVESLETPRRILLMVKAGAGTDAAIDSLKPYLEKGDIIIDGGNTFFQDTIRRNRELSAEGFNFIGTGVSGGEEGALKGPSIMPGGQKDAYELVAPILTKIAAVAEDGEPCVTYIGADGAGHYVKMVHNGIEYGDMQLIAEAYSLLKGGLNLSNEELANTFTEWNNGELSSYLIDITKDIFTKKDEDGNYLVDVILDEAANKGTGKWTSQSALDLGEPLSLITESVFARYISSLKAQRVAASKVLSGPKAQPAGDKAEFIEKVRRALYLGKIVSYAQGFSQLRAASDEYHWDLNYGEIAKIFRAGCIIRAQFLQKITDAYAENADIANLLLAPYFKKIADEYQQALRDVVAYAVQNGIPVPTFSAAVAYYDSYRAAVLPANLIQAQRDYFGAHTYKRTDKEGIFHTEWLE'

In [11]:
my_protein.load_manual_sequence(ident=mutated_id, seq=mutated_seq, outdir=gene_dir)

<SeqProp mutated at 0x7f0a8bf3fe48>

In [12]:
my_protein.align_sequences_to_representative(outdir=gene_dir)

In [13]:
my_protein.representative_sequence.sequence_alignments[0].annotations

{'a_seq': 'SRR1753782_00918',
 'b_seq': 'mutated',
 'deletions': [],
 'insertions': [],
 'mutations': [('N', 17, 'P')],
 'percent_gaps': 0.0,
 'percent_identity': 99.8,
 'percent_similarity': 99.8,
 'score': 2381.0}

In [14]:
print(str(my_protein.representative_sequence.sequence_alignments[0][0].seq))
print(str(my_protein.representative_sequence.sequence_alignments[0][1].seq))

MSKQQIGVVGMAVMGRNLALNIESRGYTVSVFNRSREKTEEVIAENPGKKLVPYYTVKEFVESLETPRRILLMVKAGAGTDAAIDSLKPYLEKGDIIIDGGNTFFQDTIRRNRELSAEGFNFIGTGVSGGEEGALKGPSIMPGGQKDAYELVAPILTKIAAVAEDGEPCVTYIGADGAGHYVKMVHNGIEYGDMQLIAEAYSLLKGGLNLSNEELANTFTEWNNGELSSYLIDITKDIFTKKDEDGNYLVDVILDEAANKGTGKWTSQSALDLGEPLSLITESVFARYISSLKAQRVAASKVLSGPKAQPAGDKAEFIEKVRRALYLGKIVSYAQGFSQLRAASDEYHWDLNYGEIAKIFRAGCIIRAQFLQKITDAYAENADIANLLLAPYFKKIADEYQQALRDVVAYAVQNGIPVPTFSAAVAYYDSYRAAVLPANLIQAQRDYFGAHTYKRTDKEGIFHTEWLE
MSKQQIGVVGMAVMGRPLALNIESRGYTVSVFNRSREKTEEVIAENPGKKLVPYYTVKEFVESLETPRRILLMVKAGAGTDAAIDSLKPYLEKGDIIIDGGNTFFQDTIRRNRELSAEGFNFIGTGVSGGEEGALKGPSIMPGGQKDAYELVAPILTKIAAVAEDGEPCVTYIGADGAGHYVKMVHNGIEYGDMQLIAEAYSLLKGGLNLSNEELANTFTEWNNGELSSYLIDITKDIFTKKDEDGNYLVDVILDEAANKGTGKWTSQSALDLGEPLSLITESVFARYISSLKAQRVAASKVLSGPKAQPAGDKAEFIEKVRRALYLGKIVSYAQGFSQLRAASDEYHWDLNYGEIAKIFRAGCIIRAQFLQKITDAYAENADIANLLLAPYFKKIADEYQQALRDVVAYAVQNGIPVPTFSAAVAYYDSYRAAVLPANLIQAQRDYFGAHTYKRTDKEGIFHTEWLE


## Viewing structures

In [15]:
my_protein.representative_structure.view_structure()

In [16]:
my_protein.view_all_mutations(scale_range=(5,7))

