# Recon3D_GP - Loading and Exploring the GEM-PRO

This notebook guides you through loading the GEM-PRO model for **Recon3D_GP** and exploring the contents of it.

### Requirements:
- ``ssbio`` - installation instructions [here](http://ssbio.readthedocs.io/en/latest/#installation), documentation [here](http://ssbio.readthedocs.io/en/latest/index.html)

### Quick start:

##### Installation
```bash
pip install nglview
pip install ssbio
```

##### Running the notebook

1. Obtain one of these three items:
    1. GitHub repository clone (`git clone https://github.com/SBRG/Recon3D`)
    1. Lite GEM-PRO archive (`Recon3D_GP_archive-lite.tar.gz`)
    1. GEM-PRO model (``Recon3D_GP.json.gz``)
1. If A: just open this notebook and run it.
1. If B: unzip the archive into the directory where this notebook is located.
1. If C: create a folder where this notebook is located, named ``Recon3D_GP/model/`` and place ``Recon3D_GP.json.gz`` in it.
1. Make sure your files are arranged like so:
```
.
├── Recon3D_GP
│   ├── data
│   ├── genes
│   ├── homology_models_raw
│   └── model
│       └── Recon3D_GP.json.gz
├── Recon3D_GP - Loading and Exploring the GEM-PRO.ipynb
└── Recon3D_GP - Updating the GEM-PRO.ipynb
```
1. Run this notebook!

## Loading the GEM-PRO

In [1]:
# Loading the JSON file
# Change the location of the .json file if it is located somewhere else
from ssbio.core.io import load_json
Recon3D_GP = load_json('./Recon3D_GP/model/Recon3D_GP.json.gz', decompression=True)



In [2]:
# # Alternative - loading the pickle file
# # Uncomment and use this loading method if the JSON file fails to load
# from ssbio.core.io import load_pickle
# Recon3D_GP = load_pickle('./Recon3D_GP/model/Recon3D_GP.pckl')

## Basic information & DataFrames

### Genes with and without structures

In [3]:
# List all genes that have at least one experimental PDB structure
Recon3D_GP.genes_with_experimental_structures[:10]

[<GenePro 26.1 at 0x7f485a0314a8>,
 <GenePro 8639.1 at 0x7f485970e0b8>,
 <GenePro 10993.1 at 0x7f4859732160>,
 <GenePro 3939.1 at 0x7f48596d54e0>,
 <GenePro 3945.1 at 0x7f48596d5e10>,
 <GenePro 8050.1 at 0x7f48596bb5c0>,
 <GenePro 1738.1 at 0x7f48596bbef0>,
 <GenePro 4967.1 at 0x7f485965eeb8>,
 <GenePro 4967.2 at 0x7f485960b4a8>,
 <GenePro 131.1 at 0x7f485960f128>]

In [4]:
# List all genes that have at least one homology model
Recon3D_GP.genes_with_homology_models[:10]

[<GenePro 1591.1 at 0x7f485970e470>,
 <GenePro 1594.1 at 0x7f4859728828>,
 <GenePro 10993.1 at 0x7f4859732160>,
 <GenePro 89874.1 at 0x7f4859743780>,
 <GenePro 160287.1 at 0x7f48596ca198>,
 <GenePro 92483.1 at 0x7f48596ca2b0>,
 <GenePro 3948.1 at 0x7f48596cab38>,
 <GenePro 3939.1 at 0x7f48596d54e0>,
 <GenePro 3945.1 at 0x7f48596d5e10>,
 <GenePro 9123.1 at 0x7f485969c240>]

### Summary DataFrames

In [5]:
# Summarize each protein
Recon3D_GP.df_proteins.head()

Unnamed: 0_level_0,id,sequences,num_sequences,representative_sequence,num_structures,experimental_structures,num_experimental_structures,homology_models,num_homology_models,representative_structure,representative_chain,representative_chain_seq_coverage,num_sequence_alignments,num_structure_alignments
gene,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
10.1,10.1,[P11245-1],1,P11245-1,3,[2pfr],1,"[H12520, ENSP00000286479]",2,2pfr-A,A,99.7,2,0
100.1,100.1,[P00813-1],1,P00813-1,22,"[3iar, 2bgn, 1w1i, 1qxl, 1krm, 2z7g, 2e1w, 1wx...",20,"[H27942, ENSP00000361965]",2,3iar-A,A,98.6,1,0
10000.1,10000.1,[Q9Y243-1],1,Q9Y243-1,2,[2x18],1,[H03080],1,H03080-X,X,100.0,9,0
10005.1,10005.1,[O14734-1],1,O14734-1,2,[],0,"[H28002, ENSP00000217455]",2,H28002-X,X,100.0,1,0
10005.2,10005.2,[],0,,0,[],0,[],0,,,,0,0


In [6]:
# Summarize the sequences mapped to each protein
Recon3D_GP.df_representative_sequences.head()

Unnamed: 0_level_0,uniprot,kegg,num_pdbs,pdbs,seq_len,sequence_file,metadata_file
gene,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
10.1,P11245-1,hsa:10,1,2PFR,290,P11245-1.fasta,P11245-1.txt
100.1,P00813-1,hsa:100,2,3IAR;1M7M,363,P00813-1.fasta,P00813-1.txt
10000.1,Q9Y243-1,hsa:10000,1,2X18,479,Q9Y243-1.fasta,Q9Y243-1.txt
10005.1,O14734-1,hsa:10005,0,,319,O14734-1.fasta,O14734-1.txt
10007.1,P46926-1,hsa:10007,1,1NE7,289,P46926-1.fasta,P46926-1.txt


In [7]:
# Summarize the structures mapped to each protein
Recon3D_GP.df_representative_structures.head()

Unnamed: 0_level_0,id,is_experimental,file_type,structure_file
gene,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
10.1,2pfr-A,True,pdb,2pfr-A_clean.pdb
100.1,3iar-A,True,pdb,3iar-A_clean.pdb
10000.1,H03080-X,False,pdb,NP_005456.1_model1_clean-X_clean.pdb
10005.1,H28002-X,False,pdb,NP_005460.2_model1_clean-X_clean.pdb
10007.1,1ne7-A,True,pdb,1ne7-A_clean.pdb


### Inspecting the content of one gene and its protein

In [8]:
# Looking at the content stored per gene
my_protein = Recon3D_GP.genes.get_by_id('100.1').protein
my_protein

<Protein 100.1 at 0x7f4858d80518>

#### Protein sequences and structures

In [9]:
my_protein.sequences

[<UniProtProp P00813-1 at 0x7f4858d1e4e0>]

In [10]:
my_protein.structures

[<StructProp H27942 at 0x7f4858d15400>,
 <StructProp ENSP00000361965 at 0x7f4858d15470>,
 <PDBProp 3iar at 0x7f4858d0d438>,
 <PDBProp 2bgn at 0x7f4858d15438>,
 <PDBProp 1w1i at 0x7f4858d15668>,
 <PDBProp 1qxl at 0x7f4858d15978>,
 <PDBProp 1krm at 0x7f4858d15c88>,
 <PDBProp 2z7g at 0x7f4858d15da0>,
 <PDBProp 2e1w at 0x7f4858d15eb8>,
 <PDBProp 1wxz at 0x7f4858d15fd0>,
 <PDBProp 1wxy at 0x7f4858d1b128>,
 <PDBProp 1vfl at 0x7f4858d1b240>,
 <PDBProp 1v7a at 0x7f4858d1b358>,
 <PDBProp 1v79 at 0x7f4858d1b470>,
 <PDBProp 1uml at 0x7f4858d1b588>,
 <PDBProp 1o5r at 0x7f4858d1b6a0>,
 <PDBProp 1ndz at 0x7f4858d1b7b8>,
 <PDBProp 1ndy at 0x7f4858d1b8d0>,
 <PDBProp 1ndw at 0x7f4858d1b9e8>,
 <PDBProp 1ndv at 0x7f4858d1bb00>,
 <PDBProp 2ada at 0x7f4858d1bc18>,
 <PDBProp 3km8 at 0x7f4858d1bd30>]

## Viewing 3D structures

In [11]:
# Displaying a structure
my_protein.structures.get_by_id('3iar').view_structure(recolor=False)

In [12]:
# Displaying the protein's representative structure (single chain)
my_protein.representative_structure.view_structure(recolor=False)