# GC-IGR Graphing Worksheet

## Purpose:

This jupyter notebook will aid in choosing a genome, download the relevant files from NCBI, extract intergenic regions, match IGRs to known Rfam annotations, and then use support vector machine classifiers to select IGRs for further analysis.  

### Necessary Imports and Configuration

In [1]:
%cd '/home/jovyan/work'

import sys
import os
import pandas as pd
import plotly as py
from sqlalchemy import or_
from ipywidgets import interactive
from src.visualization.visualize import graph_genome, graph_layout, display_genome, prepare_selection, build_interactive_fn, save_selected_IGRs
from src.data.rfam_db import rfam_session, Genome

py.io.orca.config.use_xvfb = True
pd.set_option('display.max_columns', 60)

/home/jovyan/work


## Step 1: Review Bacterial and Archaeal Genomes with Rfam annotations and select UniprotID (upid)

In [2]:
# Create a connection to local or remote Rfam Database
session = rfam_session()

# Get list of bacterial and archaeal genomes and save them in Genome DF
genome_query = session.query(Genome).filter(or_(Genome.kingdom=='archaea', Genome.kingdom=='bacteria'))

# If necessary filter for completely assembled genomes.
genome_query = genome_query.filter(Genome.assembly_level == 'complete-genome')

#Can also search by name by entering any part of the scientific name - e.g. carnobacterium"
#genome_query = genome_query.filter(Genome.scientific_name.like('%carnobacterium%'))

genome_list = genome_query.all()
session.close()
genome_df = pd.read_sql_query(genome_query.statement, genome_query.session.bind)

# Display the the genomes numbered 0-9 from the above criteria. 
genome_df.iloc[0:10]

Unnamed: 0,upid,assembly_acc,assembly_version,wgs_acc,wgs_version,assembly_name,assembly_level,study_ref,description,total_length,ungapped_length,circular,ncbi_id,scientific_name,common_name,kingdom,num_rfam_regions,num_families,is_reference,is_representative,created,updated
0,UP000000212,GCA_000317975.2,2,,,ASM31797v2,complete-genome,PRJEB544,ASM31797v2 assembly for Carnobacterium maltaro...,3650416,3650416,0.0,1234679,Carnobacterium maltaromaticum LMA28,,bacteria,134,40,0,1,2017-04-04 18:11:41,2019-03-27 14:12:41
1,UP000000229,GCA_000016565.1,1,,,ASM1656v1,complete-genome,PRJNA17457,ASM1656v1 assembly for Pseudomonas mendocina ymp,5072807,5072807,0.0,399739,Pseudomonas mendocina ymp,,bacteria,166,63,0,1,2017-04-04 18:31:16,2019-03-27 14:10:55
2,UP000000230,GCA_000016325.1,1,,,ASM1632v1,complete-genome,PRJNA17461,ASM1632v1 assembly for Enterobacter sp. 638,4676461,4676461,0.0,399742,Enterobacter sp. 638,,bacteria,322,114,0,1,2017-04-04 18:13:06,2019-03-27 14:10:55
3,UP000000231,GCA_000016345.1,1,,,ASM1634v1,complete-genome,PRJNA16679,ASM1634v1 assembly for Polynucleobacter asymbi...,2159490,2159490,0.0,312153,Polynucleobacter asymbioticus QLW-P1DMWA-1,,bacteria,55,21,0,1,2017-04-04 18:13:05,2019-03-27 14:10:33
4,UP000000233,GCA_000013785.1,1,,,ASM1378v1,complete-genome,PRJNA16817,ASM1378v1 assembly for Pseudomonas stutzeri A1501,4567418,4567418,0.0,379731,Pseudomonas stutzeri A1501,,bacteria,158,61,0,1,2017-04-04 18:13:11,2019-03-27 14:10:50
5,UP000000235,GCA_000016425.1,1,,,ASM1642v1,complete-genome,PRJNA16342,ASM1642v1 assembly for Salinispora tropica CNB...,5183331,5183331,0.0,369723,Salinispora tropica CNB-440,,bacteria,138,30,0,1,2017-04-04 18:13:12,2019-03-27 14:10:49
6,UP000000238,GCA_000012985.1,1,,,ASM1298v1,complete-genome,PRJNA16064,ASM1298v1 assembly for Hahella chejuensis KCTC...,7215267,7215267,0.0,349521,Hahella chejuensis KCTC 2396,,bacteria,123,28,0,1,2017-04-04 18:13:03,2019-03-27 14:10:46
7,UP000000239,GCA_000055785.1,1,,,ASM5578v1,complete-genome,PRJNA12636,ASM5578v1 assembly for Chromohalobacter salexi...,3696649,3696649,0.0,290398,Chromohalobacter salexigens DSM 3043,,bacteria,96,21,0,1,2017-04-04 18:13:04,2019-03-27 14:10:31
8,UP000000242,GCA_000016605.1,1,,,ASM1660v1,complete-genome,PRJNA17447,ASM1660v1 assembly for Metallosphaera sedula D...,2191517,2191517,0.0,399549,Metallosphaera sedula DSM 5348,,archaea,57,16,0,1,2017-04-04 18:28:10,2019-03-27 14:10:55
9,UP000000243,GCA_000014305.1,1,,,ASM1430v1,complete-genome,PRJNA17153,ASM1430v1 assembly for Streptococcus suis 05ZYH33,2096309,2096309,0.0,391295,Streptococcus suis 05ZYH33,,bacteria,108,37,0,1,2017-04-04 18:29:09,2019-03-27 14:10:53


## Step 2: Enter the upid of the genome of interest and graph genome.

Can also look up the Uniprot ID (upid) for a genome of interest here: http://rfam.xfam.org/search?q=entry_type:%22Genome%22


In [3]:
upid = 'UP000001174' #Enter your uniprot ID
annotated_df, fig, layout, genome = display_genome(upid)
display(fig)

FigureWidget({
    'data': [{'hoverinfo': 'skip',
              'marker': {'color': 'rgba(192,192,192,0.9)', '…

## Step 3: SVM Selection of Genomic Regions Enriched for ncRNAs

If necessary, modify the SVM selection hyperparameters (class_weight_mod, gamma_exp, and c_exp) using the sliders below.


In [4]:
# Provides an interactive modification of selection parameters.
interactive_fn = build_interactive_fn(annotated_df, layout, genome)
interactive_plot = interactive(interactive_fn, class_weight_mod=(0.05, 2.0, 0.05), gamma_exp=(-5, 5, 0.25), c_exp=(-5,5,0.25))
interactive_plot

interactive(children=(FloatSlider(value=0.5, description='class_weight_mod', max=2.0, min=0.05, step=0.05), Fl…

## Step 4: Finalize Selection and Build Blast Script/Data Tarfile

After finalizing the selection in the interactive graph above, execute the following two blocks of code to extract the selected intergenic regions, and prepare a tarfile with the collection of data and scripts necessary for blast analysis. 

In [5]:
# Extract the values from the interactive plot, and save the selection, the genome graph, and the IGR fasta files
# Create the bash job and tarfile for blast

save_selected_IGRs(interactive_plot, annotated_df, genome)

Number of known IGRs included:   31 (88.6%)
Number of unknown IGRs included: 108 (7.1%)
Fold Enrichment:  9.67
